[MDEV-6115] window functions as in the SQL standard Created: 2014-04-15  Updated: 2019-03-04  Resolved: 2019-03-04

Status: Closed
Project: MariaDB Server
Component/s: Optimizer - Window functions
Fix Version/s: 10.2.2

Type: Task Priority: Major
Reporter: Sergei Golubchik Assignee: Igor Babaev
Resolution: Fixed Votes: 17
Labels: optimizer

Issue Links:
Blocks
is blocked by MDEV-8646 Re-engineer the code for post-join op... Closed
is blocked by MDEV-9543 Parsing, name resolution and optimiza... Closed
PartOf
includes MDEV-9198 Window functions: parser, error chec... Closed
includes MDEV-9204 Window functions: locate the place wh... Closed
includes MDEV-9525 Window functions: Support for aggrega... Open
includes MDEV-9526 Compute Aggregate functions as window... Closed
includes MDEV-9676 RANGE-type frames for window functions Closed
includes MDEV-9719 Query with window function crashes wi... Closed
includes MDEV-9724 Window functions: Frame Exclusion sup... Closed
includes MDEV-9727 Window functions: datetime arithmetic... Open
includes MDEV-9736 Window functions: multiple cursors to... Open
includes MDEV-9740 Window functions: catch invalid windo... Closed
includes MDEV-9741 Window functions: support being in a ... Stalled
includes MDEV-9746 Window functions: CUME_DIST function Closed
includes MDEV-9753 Window functions: reject window frame... Closed
includes MDEV-9754 Window functions: assertion failure i... Closed
includes MDEV-9755 Buildbot shows a crash in JOIN::make_... Closed
includes MDEV-9780 Window functions: interplay between w... Closed
includes MDEV-9787 Window functions: HAVING and GROUP BY Closed
includes MDEV-9841 Window functions: embedded server fai... Closed
includes MDEV-9847 Window functions: crash with big_tabl... Closed
includes MDEV-9848 Window functions: reuse sorting and/o... Open
includes MDEV-9877 Window functions: wrong sort criteria... Closed
includes MDEV-9892 Window functions: add a counter Closed
includes MDEV-9897 Window functions: crash when ORDER BY... Closed
includes MDEV-9911 NTILE must return an error when param... Closed
includes MDEV-9935 Window functions: assertion failure w... Closed
includes MDEV-10023 Window functions: big window frame ha... Closed
includes MDEV-17191 Named WINDOW in window functions Closed
Relates
relates to MDEV-8091 Simple window functions Closed
relates to MDEV-9894 Assertion `0' failed in Window_func_r... Closed
relates to MDEV-9895 Assertion `n_rows > 0' failed in Fram... Closed
relates to MDEV-5535 cannot reopen temporary table Closed
relates to MDEV-10855 Window functions: condition pushdown ... Closed

 Description   

Spec draft

Place in the optimizer

Window functions are evaluated on the result set after the WHERE, GROUP BY and HAVING have been applied.

  • If we are operating after a temptable-based grouping operation, we can read its output (and apply HAVING on the fly)
  • if we are operating after a Join operation, or non-temptable based grouping, we will need to store the output of previous operation in a temptable and then work from there.

Basic idea about operation

Let's start with one window function:

func_name(arg) OVER (PARTITION BY part_expr  ORDER BY order_expr  $window_spec)

Window function is evaluated for each row of the result set. It is a function of row's partition, we need to look at the rows of the resultset that are ordered according to order_expr. $window_spec specifies at which rows we need to look. Sometimes it's certain preceding rows, sometimes it's certain following rows (and sometimes it's both?)

Execution can be done as follows:

  • sort resultset rows by (part_expr, order_expr).
  • scan each partition (either in reverse or forward order) and compute the value of window function on the fly.

Unresolved questions:

  • How to handle the context. If a window function depends on preceding and following rows, how do we keep track of those?
  • Whether we should use files or temp.tables for storing the sorted resultset (we will always have to store it, because filesort doesn't support pull-based reading of its resultset)

Multiple window functions

A query may use multiple window functions which use different PARTITION BY / ORDER BY expressions:

SELECT 
   window_func1 OVER (PARTITION BY part_expr1 ORDER BY order_expr1)
   window_func2 OVER (PARTITION BY part_expr2 ORDER BY order_expr2)
FROM ... WHERE ... etc

The query should be evaluated as follows:

  • sort the result according to part_expr1/order_expr1, compute values of window_func1
  • sort the result according to part_expr2/order_expr2, compute values of window_func2

Links



 Comments   
Comment by tmjee [ 2014-09-10 ]

Hi guys,

Any ideas if this will make it into 10.2.0 and also roughly when is the release date for 10.2.0?

Cheers

Comment by Sergei Petrunia [ 2014-09-10 ]

Currently, target_version= "10.2" for this task means "not 10.1". At the moment, there are no specific plans for 10.2, and nobody has committed to getting this feature done for 10.2.

Thanks for showing your interest (tasks that people are interested in are more likely to be done).

Comment by Sergei Petrunia [ 2015-05-04 ]

A subset of Windowing Functions is being worked on in MDEV-8091.

Comment by Sergei Petrunia [ 2015-10-15 ]

There is a tree for this feature, https://github.com/MariaDB/server/tree/bb-10.1-mdev6115

Comment by Sergei Petrunia [ 2015-11-26 ]

Notes from the discussion: it looks like MDEV-5535 could be used in this feature. It would be useful to have multiple cursors against the sorted table.
If I recall correctly, the cursors would point at window frame start, window frame end, and the current row.

Comment by Sergei Petrunia [ 2015-11-26 ]

MDEV-8646 is about porting the code from MySQL-5.6 that lets the query plan structures to include sorting/aggregation steps. These will be helpful for Windowing functions; Igor is working on that task.

Comment by Michael Widenius [ 2016-02-02 ]

Here follows a simple approach to implement windowing functions without having to touch much code in sql_select.cc. Almost all code is in Items with some refactoring of init_read_record() so that we can have several filesorts open for the same table.

During parsing, create window function (WF) items and put them also in a
separate list.
During fix fields, regard WF as SUM functions in HAVING (as they will
access columns in the result).
During prepare, force usage of temporary table if WF exists

  • Temporary tables created will automaticly store values used by WF functions
    and place to hold the WF result itself.
    Execute normally until a temporary table with the result
    (not necessarily ordered) is created.
    For each WF item
  • Execute a filesort in the order specified by WF
  • Store filesort info in WF item
    For each WF item
  • Reset item (item->clear)
  • Set row_to_update to first row
  • Loop over all rows according to it's sort order
  • If new window (group) changes:
  • Loop over all rows between row_to_update to current_row
  • Subtract row from top frame
  • Store in row_to_update
  • Move row_to_update to next row
  • reset item to initial value (item->clear)
  • Add current row information to item (item->add())
  • If exists relevant row at top frame of active window, subtract it
    (item->sub())
  • If exists row to be updated, store current value in row and move to
    next row. This is not the same as current row as a row may be updated
    with values from rows before current row and rows after current row.
  • Loop over all rows between row_to_update to current_row
  • Subtract row from top frame
  • Store in row_to_update
  • Move row_to_update to next row

The above works for avergage, sum, count etc.
For min/max the item would have to loop over all rows in the frame if
value of the subtracted row is smaller/bigger than the current value.

Benefits of the above solution:

  • Most work is done in items
  • Item functions will be realtively simple (clear, add, dec, loop)
  • No notable changes in optimizer, just a few hooks
  • Execution cost:
  • Always require temp table
  • 1 Scan over result set per WF sort order
    (It's trivial to combine all WF items with the same order in the above
    loop)
  • Some WF functions require two scans (to calculate number of rows in window),
    but that would be trivial to add in above code.
Comment by Sergei Petrunia [ 2016-02-03 ]

cvicentiu 's current code is in https://github.com/MariaDB/server/tree/10.2-window_simple . Note that the branch was rebased on 10.2 (so ORIGINAL WINDOW FUNCS commits were modified)

The first thing that doesn't work is: temp.table is created but it doesn't have fields for window function values.
My guess is: create_tmp_table code assumes window functions are a kind of aggregate functions, and so don't need to be in the temp. table.. I'll check this

Comment by Sergei Petrunia [ 2016-02-03 ]

Ok, window functions do not get into the temp.table because they are skipped here:

  #0  create_tmp_field 
  #1  0x0000555555ad6abd in create_tmp_table
  #2  0x0000555555ab2f77 in JOIN::init_execution 
  #3  0x0000555555ab5318 in JOIN::exec_inner 
  #4  0x0000555555ab4503 in JOIN::exec 

create_tmp_field has this switch:

  switch (type) {
  case Item::SUM_FUNC_ITEM:
  ...
  default:					// Dosen't have to be stored
    return 0;
  }

and Item::WINDOW_FUNC_ITEM ends up right in the default:.

Comment by Sergei Petrunia [ 2016-02-03 ]

Another thing to research:
when I run a SELECT with window functions, create_tmp_table is called with select_options that doesnt include TMP_TABLE_ALL_COLUMNS flag.
Hmm this seems to be ok, if I set that flag in gdb the field is still not created for due to the switch in create_tmp_field.

Comment by Sergei Petrunia [ 2016-02-03 ]

Ok this patch makes the temporary table to create fields for window functions:

diff --git a/sql/sql_select.cc b/sql/sql_select.cc
index 94ed74c..43dc9bf 100644
--- a/sql/sql_select.cc
+++ b/sql/sql_select.cc
@@ -16138,6 +16138,7 @@ Field *create_tmp_field(THD *thd, TABLE *table,Item *item, Item::Type type,
   case Item::NULL_ITEM:
   case Item::VARBIN_ITEM:
   case Item::CACHE_ITEM:
+  case Item::WINDOW_FUNC_ITEM: // psergey-winfunc:
   case Item::EXPR_CACHE_ITEM:
     if (make_copy_field)
     {

However, the temp.table field for "row_number() over (partition by a)" is of
type Field_double.
This is because item_windowfunc.h defines Item_window_func::field_type to
return MYSQL_TYPE_DOUBLE.

Comment by Sergei Petrunia [ 2016-02-04 ]

Looking at whether the

> Temporary tables created will automaticly store values used by WF functions and place to hold the WF result itself.

part of the spec is satisfied.

When I try this query:

select  row_number() over (partition by a,b order by x) from t1;

the temporary table has fields:

(gdb) p exec_tmp_table1->s->fields
  $81 = 4
(gdb) p exec_tmp_table1->field[0].field_name
  $82 = 0x7fff5400d555 "x"
(gdb) p exec_tmp_table1->field[1].field_name
  $83 = 0x7fff5400d553 "b"
(gdb) p exec_tmp_table1->field[2].field_name
  $84 = 0x7fff5400d551 "a"
(gdb) p exec_tmp_table1->field[3].field_name
  $85 = 0x7fff54007b08 "row_number() over (partition by a,b order by x)"

OK. However, when I try this:

select  row_number() over (partition by a+b order by x) from t1;

the temporary table only has:

(gdb) p exec_tmp_table1->s->fields
  $88 = 1
(gdb) p exec_tmp_table1->field[0].field_name
  $89 = 0x7fff54007b80 "row_number() over (partition by a+b order by x)"

This is incorrect.

Comment by Sergei Petrunia [ 2016-02-04 ]

Questions that need to be resolved:

== sort_order construction ==
create_sort_index() uses make_unireg_sortorder() to construct sort ordering
definition for filesort. Window functions feature uses its own code, which does a slightly different thing. Is there any diference?

== PARTITION BY group detection ==
When reading filesort results, what is the best way to tell when one PARTITION BY group ended and the next one began? There are two options:
1. Somehow compute Item expressions and compare them like test_if_group_changed does.

2. filesort result actually consists of { sort_key=

{parititon_sort_key, orderby_sort_key}

, rowid} pairs (right?) We could compare the sort_key parts. The problem is being able to tell where the partition_sort_key ends (we need to compare partition_sort_key only). Are they variable-length or they always have the same length?

Comment by Sergei Petrunia [ 2016-02-04 ]

Discovered a bug in the current code: st_select_lex::window_specs is not cleaned up across queries.

If you run queries with window functions repeatedly, you will eventually fail this assert:

  DBUG_ASSERT(all_fields.elements <=
              thd->lex->current_select->ref_pointer_array_size);

Which also raises a question: are window functions counted when computing the upper bound for ref_pointer_array_size? (And will this question be relevant after MDEV-8646 is pushed?)

Comment by Sergei Petrunia [ 2016-02-04 ]

Ok the problem with temp.table not having all required fields is gone after I fixed the bug mentioned in the previous comment.

select a, row_number() over (partition by a+b order by x) from t1;

(gdb) p exec_tmp_table1->s->fields
  $59 = 4
(gdb) p exec_tmp_table1->field[0]
  $60 = (Field_string *) 0x7fff58017118
(gdb) p exec_tmp_table1->field[1]
  $61 = (Field_longlong *) 0x7fff580171e0
(gdb) p exec_tmp_table1->field[2]
  $62 = (Field_long *) 0x7fff580172a0
(gdb) p exec_tmp_table1->field[3]
  $63 = (Field_double *) 0x7fff58017360

(gdb) p exec_tmp_table1->field[0]->field_name
  $64 = 0x7fff5800d555 "x"
(gdb) p exec_tmp_table1->field[1]->field_name
  $65 = 0x0
(gdb) p exec_tmp_table1->field[2]->field_name
  $66 = 0x7fff5800d551 "a"
(gdb) p exec_tmp_table1->field[3]->field_name
  $67 = 0x7fff58007c90 "row_number() over (partition by a+b order by x)"

field[1] corresponds to Item_func_plus.

Comment by Sergei Petrunia [ 2016-02-04 ]

filesort result actually consists of {sort_key= {parititon_sort_key, orderby_sort_key} , rowid} pairs (right?)

Nope. Tried a small example (too small to require merges), and filesort() called save_index() to produce an array with result data. It seems it wrote only rowids there.

The problem is being able to tell where the partition_sort_key ends (we need to compare partition_sort_key only). Are they variable-length or they always have the same length?

This part won't have been a problem. Look at filesort()'s sort_order parameter. SORT_FIELD::length is the length of mem-comparable form (i.e. it is a constant). For NULL-able expressions one needs to add a NULL-byte.

We can't use sort keys produced by filesort, but perhaps we could re-use filesort's approach to producing them.. We have the filled SORT_FIELD structure...

Comment by Igor Babaev [ 2016-02-05 ]

Sergey,

create_sort_index() creates a special index. After it has been created you can read records one by one in the order that you passed as a parameter to create_sort_index(). The records are read with a special read function.
When reading records you can catch the end of the last group. To do this you use the values of the group fields in a special buffer.
You can see all this for any test case that uses preliminary sorting for the group by operation.

Comment by Sergei Petrunia [ 2016-02-06 ]

Filed

  • MDEV-9526: Compute Aggregate functions as window functions
  • MDEV-9525: Window functions: Support for aggregate_func(DISTINCT ...) OVER (...)
Comment by Sergei Petrunia [ 2016-02-25 ]

My recollection of the discussion on the last call:

1. High-level and low-level cursors

Igor> We need multiple level of cursors for reading the file in order.

  • Low-level cursor will simply read rowids
  • Higher-level cursor will actually fetch the table row.
    This will be useful to doing other changes, read on.

2. De-couple cursor movement from action

Igor> currently, classes derived from Frame_cursor directly make calls to Item_sum->add|remove. I want cursor movement and action separated.
SergeiP> Don't understand the point of separation. There is only one action, no plans for other actions.

3. Use of row numbers for partition bound detection.

Igor> it is possible to use row number to tell when partition bound ends.
We have

  • frame start
  • current row
  • frame end

onle of these cursors is always ahead of the other two. The idea is:

  • let all cursors remember how many rows they have scanned.
  • Let the cursor that is always ahead remember row_number of the partition bound when it sees it.
  • this allows two other cursors to avoid doing PARTITION BY field comparisons.
Comment by Sergei Petrunia [ 2016-03-02 ]

Take-aways from yesterday discussion

Cursors

It seems like the latest cursors are a step in the right direction. Need to continue

Window frame exclusion

SergeyP> SQL:2008 specifies that one can use "Window frame exclusion" for window functions. Our sql_yacc.yy supports it (see opt_window_frame_exclusion).
Q: How to actually support this?
Igor> No idea.
SergeyP> will need to extend "streaming aggregates" to be "arbitrary insert/removal aggregates".

(Btw, PostgreSQL doesn't support frame exclusion).

Further tasks

SergeiP

  • code cleanups as requested by Igor
  • Improve frame bound cursors
  • Support RANGE-type frame bound cursors

Vicentiu

  • Finish CUME_DIST (as a model two-pass window function)
  • Implement AVG, BIT_OR, BIT_AND, BIT_XOR - aggregates that support streaming.

Igor

  • "Sort" window frames so that we know which window functions can
    share the sorting (the standard actually requires this, for the purpose of
    handling non-determistic cases)
Comment by Sergei Petrunia [ 2016-03-02 ]

Filed MDEV-9676: RANGE-type frames for window functions

Generated at Thu Feb 08 07:09:28 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.