Window functions are evaluated on the result set after the WHERE, GROUP BY and HAVING have been applied.
If we are operating after a temptable-based grouping operation, we can read its output (and apply HAVING on the fly)
if we are operating after a Join operation, or non-temptable based grouping, we will need to store the output of previous operation in a temptable and then work from there.
Basic idea about operation
Let's start with one window function:
func_name(arg) OVER (PARTITION BY part_expr ORDER BY order_expr $window_spec)
Window function is evaluated for each row of the result set. It is a function of row's partition, we need to look at the rows of the resultset that are ordered according to order_expr. $window_spec specifies at which rows we need to look. Sometimes it's certain preceding rows, sometimes it's certain following rows (and sometimes it's both?)
Execution can be done as follows:
sort resultset rows by (part_expr, order_expr).
scan each partition (either in reverse or forward order) and compute the value of window function on the fly.
Unresolved questions:
How to handle the context. If a window function depends on preceding and following rows, how do we keep track of those?
Whether we should use files or temp.tables for storing the sorted resultset (we will always have to store it, because filesort doesn't support pull-based reading of its resultset)
Multiple window functions
A query may use multiple window functions which use different PARTITION BY / ORDER BY expressions:
SELECT
window_func1 OVER (PARTITION BY part_expr1 ORDER BY order_expr1)
window_func2 OVER (PARTITION BY part_expr2 ORDER BY order_expr2)
FROM ... WHERE ... etc
The query should be evaluated as follows:
sort the result according to part_expr1/order_expr1, compute values of window_func1
sort the result according to part_expr2/order_expr2, compute values of window_func2
Currently, target_version= "10.2" for this task means "not 10.1". At the moment, there are no specific plans for 10.2, and nobody has committed to getting this feature done for 10.2.
Thanks for showing your interest (tasks that people are interested in are more likely to be done).
Sergei Petrunia
added a comment - Currently, target_version= "10.2" for this task means "not 10.1". At the moment, there are no specific plans for 10.2, and nobody has committed to getting this feature done for 10.2.
Thanks for showing your interest (tasks that people are interested in are more likely to be done).
Notes from the discussion: it looks like MDEV-5535 could be used in this feature. It would be useful to have multiple cursors against the sorted table.
If I recall correctly, the cursors would point at window frame start, window frame end, and the current row.
Sergei Petrunia
added a comment - Notes from the discussion: it looks like MDEV-5535 could be used in this feature. It would be useful to have multiple cursors against the sorted table.
If I recall correctly, the cursors would point at window frame start, window frame end, and the current row.
MDEV-8646 is about porting the code from MySQL-5.6 that lets the query plan structures to include sorting/aggregation steps. These will be helpful for Windowing functions; Igor is working on that task.
Sergei Petrunia
added a comment - MDEV-8646 is about porting the code from MySQL-5.6 that lets the query plan structures to include sorting/aggregation steps. These will be helpful for Windowing functions; Igor is working on that task.
Here follows a simple approach to implement windowing functions without having to touch much code in sql_select.cc. Almost all code is in Items with some refactoring of init_read_record() so that we can have several filesorts open for the same table.
During parsing, create window function (WF) items and put them also in a
separate list.
During fix fields, regard WF as SUM functions in HAVING (as they will
access columns in the result).
During prepare, force usage of temporary table if WF exists
Temporary tables created will automaticly store values used by WF functions
and place to hold the WF result itself.
Execute normally until a temporary table with the result
(not necessarily ordered) is created.
For each WF item
Execute a filesort in the order specified by WF
Store filesort info in WF item
For each WF item
Reset item (item->clear)
Set row_to_update to first row
Loop over all rows according to it's sort order
If new window (group) changes:
Loop over all rows between row_to_update to current_row
Subtract row from top frame
Store in row_to_update
Move row_to_update to next row
reset item to initial value (item->clear)
Add current row information to item (item->add())
If exists relevant row at top frame of active window, subtract it
(item->sub())
If exists row to be updated, store current value in row and move to
next row. This is not the same as current row as a row may be updated
with values from rows before current row and rows after current row.
Loop over all rows between row_to_update to current_row
Subtract row from top frame
Store in row_to_update
Move row_to_update to next row
The above works for avergage, sum, count etc.
For min/max the item would have to loop over all rows in the frame if
value of the subtracted row is smaller/bigger than the current value.
Benefits of the above solution:
Most work is done in items
Item functions will be realtively simple (clear, add, dec, loop)
No notable changes in optimizer, just a few hooks
Execution cost:
Always require temp table
1 Scan over result set per WF sort order
(It's trivial to combine all WF items with the same order in the above
loop)
Some WF functions require two scans (to calculate number of rows in window),
but that would be trivial to add in above code.
Michael Widenius
added a comment - Here follows a simple approach to implement windowing functions without having to touch much code in sql_select.cc. Almost all code is in Items with some refactoring of init_read_record() so that we can have several filesorts open for the same table.
During parsing, create window function (WF) items and put them also in a
separate list.
During fix fields, regard WF as SUM functions in HAVING (as they will
access columns in the result).
During prepare, force usage of temporary table if WF exists
Temporary tables created will automaticly store values used by WF functions
and place to hold the WF result itself.
Execute normally until a temporary table with the result
(not necessarily ordered) is created.
For each WF item
Execute a filesort in the order specified by WF
Store filesort info in WF item
For each WF item
Reset item (item->clear)
Set row_to_update to first row
Loop over all rows according to it's sort order
If new window (group) changes:
Loop over all rows between row_to_update to current_row
Subtract row from top frame
Store in row_to_update
Move row_to_update to next row
reset item to initial value (item->clear)
Add current row information to item (item->add())
If exists relevant row at top frame of active window, subtract it
(item->sub())
If exists row to be updated, store current value in row and move to
next row. This is not the same as current row as a row may be updated
with values from rows before current row and rows after current row.
Loop over all rows between row_to_update to current_row
Subtract row from top frame
Store in row_to_update
Move row_to_update to next row
The above works for avergage, sum, count etc.
For min/max the item would have to loop over all rows in the frame if
value of the subtracted row is smaller/bigger than the current value.
Benefits of the above solution:
Most work is done in items
Item functions will be realtively simple (clear, add, dec, loop)
No notable changes in optimizer, just a few hooks
Execution cost:
Always require temp table
1 Scan over result set per WF sort order
(It's trivial to combine all WF items with the same order in the above
loop)
Some WF functions require two scans (to calculate number of rows in window),
but that would be trivial to add in above code.
The first thing that doesn't work is: temp.table is created but it doesn't have fields for window function values.
My guess is: create_tmp_table code assumes window functions are a kind of aggregate functions, and so don't need to be in the temp. table.. I'll check this
Sergei Petrunia
added a comment - - edited cvicentiu 's current code is in https://github.com/MariaDB/server/tree/10.2-window_simple . Note that the branch was rebased on 10.2 (so ORIGINAL WINDOW FUNCS commits were modified)
The first thing that doesn't work is: temp.table is created but it doesn't have fields for window function values.
My guess is: create_tmp_table code assumes window functions are a kind of aggregate functions, and so don't need to be in the temp. table.. I'll check this
Ok, window functions do not get into the temp.table because they are skipped here:
#0 create_tmp_field
#1 0x0000555555ad6abd in create_tmp_table
#2 0x0000555555ab2f77 in JOIN::init_execution
#3 0x0000555555ab5318 in JOIN::exec_inner
#4 0x0000555555ab4503 in JOIN::exec
create_tmp_field has this switch:
switch (type) {
case Item::SUM_FUNC_ITEM:
...
default: // Dosen't have to be stored
return 0;
}
and Item::WINDOW_FUNC_ITEM ends up right in the default:.
Sergei Petrunia
added a comment -
Ok, window functions do not get into the temp.table because they are skipped here:
#0 create_tmp_field
#1 0x0000555555ad6abd in create_tmp_table
#2 0x0000555555ab2f77 in JOIN::init_execution
#3 0x0000555555ab5318 in JOIN::exec_inner
#4 0x0000555555ab4503 in JOIN::exec
create_tmp_field has this switch:
switch (type) {
case Item::SUM_FUNC_ITEM:
...
default: // Dosen't have to be stored
return 0;
}
and Item::WINDOW_FUNC_ITEM ends up right in the default: .
Another thing to research:
when I run a SELECT with window functions, create_tmp_table is called with select_options that doesnt include TMP_TABLE_ALL_COLUMNS flag.
Hmm this seems to be ok, if I set that flag in gdb the field is still not created for due to the switch in create_tmp_field.
Sergei Petrunia
added a comment - Another thing to research:
when I run a SELECT with window functions, create_tmp_table is called with select_options that doesnt include TMP_TABLE_ALL_COLUMNS flag.
Hmm this seems to be ok, if I set that flag in gdb the field is still not created for due to the switch in create_tmp_field.
+ case Item::WINDOW_FUNC_ITEM: // psergey-winfunc:
case Item::EXPR_CACHE_ITEM:
if (make_copy_field)
{
However, the temp.table field for "row_number() over (partition by a)" is of
type Field_double.
This is because item_windowfunc.h defines Item_window_func::field_type to
return MYSQL_TYPE_DOUBLE.
Sergei Petrunia
added a comment -
Ok this patch makes the temporary table to create fields for window functions:
diff --git a/sql/sql_select.cc b/sql/sql_select.cc
index 94ed74c..43dc9bf 100644
--- a/sql/sql_select.cc
+++ b/sql/sql_select.cc
@@ -16138,6 +16138,7 @@ Field *create_tmp_field(THD *thd, TABLE *table,Item *item, Item::Type type,
case Item::NULL_ITEM:
case Item::VARBIN_ITEM:
case Item::CACHE_ITEM:
+ case Item::WINDOW_FUNC_ITEM: // psergey-winfunc:
case Item::EXPR_CACHE_ITEM:
if (make_copy_field)
{
However, the temp.table field for "row_number() over (partition by a)" is of
type Field_double.
This is because item_windowfunc.h defines Item_window_func::field_type to
return MYSQL_TYPE_DOUBLE.
> Temporary tables created will automaticly store values used by WF functions and place to hold the WF result itself.
part of the spec is satisfied.
When I try this query:
select row_number() over (partition by a,b order by x) from t1;
the temporary table has fields:
(gdb) p exec_tmp_table1->s->fields
$81 = 4
(gdb) p exec_tmp_table1->field[0].field_name
$82 = 0x7fff5400d555 "x"
(gdb) p exec_tmp_table1->field[1].field_name
$83 = 0x7fff5400d553 "b"
(gdb) p exec_tmp_table1->field[2].field_name
$84 = 0x7fff5400d551 "a"
(gdb) p exec_tmp_table1->field[3].field_name
$85 = 0x7fff54007b08 "row_number() over (partition by a,b order by x)"
OK. However, when I try this:
select row_number() over (partition by a+b order by x) from t1;
the temporary table only has:
(gdb) p exec_tmp_table1->s->fields
$88 = 1
(gdb) p exec_tmp_table1->field[0].field_name
$89 = 0x7fff54007b80 "row_number() over (partition by a+b order by x)"
This is incorrect.
Sergei Petrunia
added a comment - - edited Looking at whether the
> Temporary tables created will automaticly store values used by WF functions and place to hold the WF result itself.
part of the spec is satisfied.
When I try this query:
select row_number() over (partition by a,b order by x) from t1;
the temporary table has fields:
(gdb) p exec_tmp_table1->s->fields
$81 = 4
(gdb) p exec_tmp_table1->field[0].field_name
$82 = 0x7fff5400d555 "x"
(gdb) p exec_tmp_table1->field[1].field_name
$83 = 0x7fff5400d553 "b"
(gdb) p exec_tmp_table1->field[2].field_name
$84 = 0x7fff5400d551 "a"
(gdb) p exec_tmp_table1->field[3].field_name
$85 = 0x7fff54007b08 "row_number() over (partition by a,b order by x)"
OK. However, when I try this:
select row_number() over (partition by a+b order by x) from t1;
the temporary table only has:
(gdb) p exec_tmp_table1->s->fields
$88 = 1
(gdb) p exec_tmp_table1->field[0].field_name
$89 = 0x7fff54007b80 "row_number() over (partition by a+b order by x)"
This is incorrect.
== sort_order construction ==
create_sort_index() uses make_unireg_sortorder() to construct sort ordering
definition for filesort. Window functions feature uses its own code, which does a slightly different thing. Is there any diference?
== PARTITION BY group detection ==
When reading filesort results, what is the best way to tell when one PARTITION BY group ended and the next one began? There are two options:
1. Somehow compute Item expressions and compare them like test_if_group_changed does.
2. filesort result actually consists of { sort_key=
{parititon_sort_key, orderby_sort_key}
, rowid} pairs (right?) We could compare the sort_key parts. The problem is being able to tell where the partition_sort_key ends (we need to compare partition_sort_key only). Are they variable-length or they always have the same length?
Sergei Petrunia
added a comment - - edited Questions that need to be resolved:
== sort_order construction ==
create_sort_index() uses make_unireg_sortorder() to construct sort ordering
definition for filesort. Window functions feature uses its own code, which does a slightly different thing. Is there any diference?
== PARTITION BY group detection ==
When reading filesort results, what is the best way to tell when one PARTITION BY group ended and the next one began? There are two options:
1. Somehow compute Item expressions and compare them like test_if_group_changed does.
2. filesort result actually consists of { sort_key=
{parititon_sort_key, orderby_sort_key}
, rowid} pairs (right?) We could compare the sort_key parts. The problem is being able to tell where the partition_sort_key ends (we need to compare partition_sort_key only). Are they variable-length or they always have the same length?
Which also raises a question: are window functions counted when computing the upper bound for ref_pointer_array_size? (And will this question be relevant after MDEV-8646 is pushed?)
Sergei Petrunia
added a comment - Discovered a bug in the current code: st_select_lex::window_specs is not cleaned up across queries.
If you run queries with window functions repeatedly, you will eventually fail this assert:
DBUG_ASSERT(all_fields.elements <=
thd->lex->current_select->ref_pointer_array_size);
Which also raises a question: are window functions counted when computing the upper bound for ref_pointer_array_size? (And will this question be relevant after MDEV-8646 is pushed?)
Ok the problem with temp.table not having all required fields is gone after I fixed the bug mentioned in the previous comment.
select a, row_number() over (partition by a+b order by x) from t1;
(gdb) p exec_tmp_table1->s->fields
$59 = 4
(gdb) p exec_tmp_table1->field[0]
$60 = (Field_string *) 0x7fff58017118
(gdb) p exec_tmp_table1->field[1]
$61 = (Field_longlong *) 0x7fff580171e0
(gdb) p exec_tmp_table1->field[2]
$62 = (Field_long *) 0x7fff580172a0
(gdb) p exec_tmp_table1->field[3]
$63 = (Field_double *) 0x7fff58017360
(gdb) p exec_tmp_table1->field[0]->field_name
$64 = 0x7fff5800d555 "x"
(gdb) p exec_tmp_table1->field[1]->field_name
$65 = 0x0
(gdb) p exec_tmp_table1->field[2]->field_name
$66 = 0x7fff5800d551 "a"
(gdb) p exec_tmp_table1->field[3]->field_name
$67 = 0x7fff58007c90 "row_number() over (partition by a+b order by x)"
field[1] corresponds to Item_func_plus.
Sergei Petrunia
added a comment - - edited Ok the problem with temp.table not having all required fields is gone after I fixed the bug mentioned in the previous comment.
select a, row_number() over (partition by a+b order by x) from t1;
(gdb) p exec_tmp_table1->s->fields
$59 = 4
(gdb) p exec_tmp_table1->field[0]
$60 = (Field_string *) 0x7fff58017118
(gdb) p exec_tmp_table1->field[1]
$61 = (Field_longlong *) 0x7fff580171e0
(gdb) p exec_tmp_table1->field[2]
$62 = (Field_long *) 0x7fff580172a0
(gdb) p exec_tmp_table1->field[3]
$63 = (Field_double *) 0x7fff58017360
(gdb) p exec_tmp_table1->field[0]->field_name
$64 = 0x7fff5800d555 "x"
(gdb) p exec_tmp_table1->field[1]->field_name
$65 = 0x0
(gdb) p exec_tmp_table1->field[2]->field_name
$66 = 0x7fff5800d551 "a"
(gdb) p exec_tmp_table1->field[3]->field_name
$67 = 0x7fff58007c90 "row_number() over (partition by a+b order by x)"
field [1] corresponds to Item_func_plus.
filesort result actually consists of {sort_key= {parititon_sort_key, orderby_sort_key} , rowid} pairs (right?)
Nope. Tried a small example (too small to require merges), and filesort() called save_index() to produce an array with result data. It seems it wrote only rowids there.
The problem is being able to tell where the partition_sort_key ends (we need to compare partition_sort_key only). Are they variable-length or they always have the same length?
This part won't have been a problem. Look at filesort()'s sort_order parameter. SORT_FIELD::length is the length of mem-comparable form (i.e. it is a constant). For NULL-able expressions one needs to add a NULL-byte.
We can't use sort keys produced by filesort, but perhaps we could re-use filesort's approach to producing them.. We have the filled SORT_FIELD structure...
Sergei Petrunia
added a comment - - edited
filesort result actually consists of {sort_key= {parititon_sort_key, orderby_sort_key} , rowid} pairs (right?)
Nope. Tried a small example (too small to require merges), and filesort() called save_index() to produce an array with result data. It seems it wrote only rowids there.
The problem is being able to tell where the partition_sort_key ends (we need to compare partition_sort_key only). Are they variable-length or they always have the same length?
This part won't have been a problem. Look at filesort()'s sort_order parameter. SORT_FIELD::length is the length of mem-comparable form (i.e. it is a constant). For NULL-able expressions one needs to add a NULL-byte.
We can't use sort keys produced by filesort, but perhaps we could re-use filesort's approach to producing them.. We have the filled SORT_FIELD structure...
create_sort_index() creates a special index. After it has been created you can read records one by one in the order that you passed as a parameter to create_sort_index(). The records are read with a special read function.
When reading records you can catch the end of the last group. To do this you use the values of the group fields in a special buffer.
You can see all this for any test case that uses preliminary sorting for the group by operation.
Igor Babaev (Inactive)
added a comment - Sergey,
create_sort_index() creates a special index. After it has been created you can read records one by one in the order that you passed as a parameter to create_sort_index(). The records are read with a special read function.
When reading records you can catch the end of the last group. To do this you use the values of the group fields in a special buffer.
You can see all this for any test case that uses preliminary sorting for the group by operation.
MDEV-9526: Compute Aggregate functions as window functions
MDEV-9525: Window functions: Support for aggregate_func(DISTINCT ...) OVER (...)
Sergei Petrunia
added a comment - Filed
MDEV-9526 : Compute Aggregate functions as window functions
MDEV-9525 : Window functions: Support for aggregate_func(DISTINCT ...) OVER (...)
My recollection of the discussion on the last call:
1. High-level and low-level cursors
Igor> We need multiple level of cursors for reading the file in order.
Low-level cursor will simply read rowids
Higher-level cursor will actually fetch the table row.
This will be useful to doing other changes, read on.
2. De-couple cursor movement from action
Igor> currently, classes derived from Frame_cursor directly make calls to Item_sum->add|remove. I want cursor movement and action separated.
SergeiP> Don't understand the point of separation. There is only one action, no plans for other actions.
3. Use of row numbers for partition bound detection.
Igor> it is possible to use row number to tell when partition bound ends.
We have
frame start
current row
frame end
onle of these cursors is always ahead of the other two. The idea is:
let all cursors remember how many rows they have scanned.
Let the cursor that is always ahead remember row_number of the partition bound when it sees it.
this allows two other cursors to avoid doing PARTITION BY field comparisons.
Sergei Petrunia
added a comment - My recollection of the discussion on the last call:
1. High-level and low-level cursors
Igor> We need multiple level of cursors for reading the file in order.
Low-level cursor will simply read rowids
Higher-level cursor will actually fetch the table row.
This will be useful to doing other changes, read on.
2. De-couple cursor movement from action
Igor> currently, classes derived from Frame_cursor directly make calls to Item_sum->add|remove. I want cursor movement and action separated.
SergeiP> Don't understand the point of separation. There is only one action, no plans for other actions.
3. Use of row numbers for partition bound detection.
Igor> it is possible to use row number to tell when partition bound ends.
We have
frame start
current row
frame end
onle of these cursors is always ahead of the other two. The idea is:
let all cursors remember how many rows they have scanned.
Let the cursor that is always ahead remember row_number of the partition bound when it sees it.
this allows two other cursors to avoid doing PARTITION BY field comparisons.
It seems like the latest cursors are a step in the right direction. Need to continue
Window frame exclusion
SergeyP> SQL:2008 specifies that one can use "Window frame exclusion" for window functions. Our sql_yacc.yy supports it (see opt_window_frame_exclusion).
Q: How to actually support this?
Igor> No idea.
SergeyP> will need to extend "streaming aggregates" to be "arbitrary insert/removal aggregates".
(Btw, PostgreSQL doesn't support frame exclusion).
Further tasks
SergeiP
code cleanups as requested by Igor
Improve frame bound cursors
Support RANGE-type frame bound cursors
Vicentiu
Finish CUME_DIST (as a model two-pass window function)
Implement AVG, BIT_OR, BIT_AND, BIT_XOR - aggregates that support streaming.
Igor
"Sort" window frames so that we know which window functions can
share the sorting (the standard actually requires this, for the purpose of
handling non-determistic cases)
Sergei Petrunia
added a comment - Take-aways from yesterday discussion
Cursors
It seems like the latest cursors are a step in the right direction. Need to continue
Window frame exclusion
SergeyP> SQL:2008 specifies that one can use "Window frame exclusion" for window functions. Our sql_yacc.yy supports it (see opt_window_frame_exclusion).
Q: How to actually support this?
Igor> No idea.
SergeyP> will need to extend "streaming aggregates" to be "arbitrary insert/removal aggregates".
(Btw, PostgreSQL doesn't support frame exclusion).
Further tasks
SergeiP
code cleanups as requested by Igor
Improve frame bound cursors
Support RANGE-type frame bound cursors
Vicentiu
Finish CUME_DIST (as a model two-pass window function)
Implement AVG, BIT_OR, BIT_AND, BIT_XOR - aggregates that support streaming.
Igor
"Sort" window frames so that we know which window functions can
share the sorting (the standard actually requires this, for the purpose of
handling non-determistic cases)
Hi guys,
Any ideas if this will make it into 10.2.0 and also roughly when is the release date for 10.2.0?
Cheers