[MDEV-12985] support percentile and median window functions - Jira

Details

Type: Task
Status: Closed (View Workflow)
Priority: Major
Resolution: Fixed
Fix Version/s: 10.3.3
Component/s: Optimizer - Window functions
Labels:
None

Sprint:
10.3.1-2, 10.3.3-1

Description

The percentile_cont and percentile_disc window functions are available in columnstore and many other databases. These allow calculation of percentiles. Percentile_cont will average 2 rows if one is not identified while Percentile_disc picks the first row in the window. Finally a median function should exist which is equivalent to percentile_cont(0.5).

These have slightly different syntax than other window function to specify the column:

percentile_cont(0.5) within group (order by amount) over (partition by owner) pct_cont,
percentile_disc(0.5) within group (order by amount) over (partition by owner) pct_disc

Some investigation

percentile_cont and percentile_disc are not specifically window functions. They originally are "ordered-set aggregate functions" (#1) which one can also use as window functions (#2):

Ordered-set aggreates

The syntax for case #1:

  percentile_cont(fraction) WITHIN GROUP (ORDER BY sort_expression)

Note the lack of OVER clause.
Ordered-set aggregate functions are supported by:

https://www.postgresql.org/docs/current/static/functions-aggregate.html
http://docs.aws.amazon.com/redshift/latest/dg/c_Aggregate_Functions.html
Neither MariaDB nor MySQL support any "ordered-set aggregate functions".

Ordered-set aggregates as window functions

Syntax for case #2 (ordered-set aggregate, used as window function)

PERCENTILE_DISC ( percentile )

WITHIN GROUP (ORDER BY expr)

OVER (  [ PARTITION BY expr_list ]  )

(BTW: note that PostgreSQL doesn't support ordered-set-aggregates-as-window functions: https://www.postgresql.org/docs/current/static/functions-window.html ,

any built-in or user-defined normal aggregate function (but not ordered-set or hypothetical-set aggregates) can be used as a window function)

Attachments

Issue Links

blocks

MCOL-624 MariaDB 10.2 WF create MEDIAN, PERCENTILE_CONT and PERCENTILE_DISC Window functions

Closed

is duplicated by

MDEV-4835 Add Median Function

Closed

is part of

MDEV-12987 complete window function support for columnstore parity

Open

relates to

MDEV-13854 Supporting datetime fields in the order by clause of Percentile functions

Open

MDEV-27395 Named windows do not work with MEDIAN() window function.

Open

Activity

Ascending order - Click to sort in descending order

Sergei Petrunia added a comment - 2017-06-09 09:32

[dthompson, a question from me and varun: Does ColumnStore need just "Ordered-set aggregates as window functions", or it needs both "Ordered-set aggregates as window functions" and "Ordered-set aggregates". IIRC it supported both?

Sergei Petrunia added a comment - 2017-06-09 09:32 [ dthompson , a question from me and varun : Does ColumnStore need just "Ordered-set aggregates as window functions", or it needs both "Ordered-set aggregates as window functions" and "Ordered-set aggregates". IIRC it supported both?

David Thompson (Inactive) added a comment - 2017-06-09 20:31

ColumnStore 1.0 supported these only as window functions (and i learned a new term!). If you can easily add as a regular aggregate that's nice to have but not required.

Loop in David.Hall when you have a rough design as we will still need to reimplement the bottom end of the implementation on the columnstore side.

David Thompson (Inactive) added a comment - 2017-06-09 20:31 ColumnStore 1.0 supported these only as window functions (and i learned a new term!). If you can easily add as a regular aggregate that's nice to have but not required. Loop in David.Hall when you have a rough design as we will still need to reimplement the bottom end of the implementation on the columnstore side.

Varun Gupta (Inactive) added a comment - 2017-06-11 12:38

The grammar being:

<inverse distribution function> ::=

|<inverse distribution function type> <left paren>

<inverse distribution function argument> <right paren><within group specification>

<inverse distribution function argument> ::=

  <numeric value expression>

<inverse distribution function type> ::=

   PERCENTILE_CONT

 | PERCENTILE_DISC

<within group specification> ::=

  WITHIN GROUP <left paren> ORDER BY <sort specification> <right paren>

Varun Gupta (Inactive) added a comment - 2017-06-11 12:38 The grammar being: <inverse distribution function> ::= |<inverse distribution function type> <left paren> <inverse distribution function argument> <right paren><within group specification> <inverse distribution function argument> ::= <numeric value expression> <inverse distribution function type> ::= PERCENTILE_CONT | PERCENTILE_DISC <within group specification> ::= WITHIN GROUP <left paren> ORDER BY <sort specification> <right paren>

Varun Gupta (Inactive) added a comment - 2017-06-11 12:40

Specification for the percentile functions

For the <inverse distribution function>

a)  The <within group specification> shall contain a single <sort specification>

b)  The <inverse distribution function> shall not contain a <window function>, a

     <set  function specification>, or a <query expression>

      c) Let DT be the declared type of the <value expression> simply contained in the

          <sort specification>.

d)  DT shall be numeric or interval.

e)  The declared type of the result is

     Case:

      i)  If DT is numeric, then approximate numeric with implementation-defined precision.

     ii)  If DT is interval, then DT.

Varun Gupta (Inactive) added a comment - 2017-06-11 12:40 Specification for the percentile functions For the <inverse distribution function> a) The <within group specification> shall contain a single <sort specification> b) The <inverse distribution function> shall not contain a <window function>, a <set function specification>, or a <query expression> c) Let DT be the declared type of the <value expression> simply contained in the <sort specification>. d) DT shall be numeric or interval. e) The declared type of the result is Case: i) If DT is numeric, then approximate numeric with implementation-defined precision. ii) If DT is interval, then DT.

Varun Gupta (Inactive) added a comment - 2017-06-12 18:19 - edited

More specifications for the inverse distribution function argument

a) Let NVE be the value of the <inverse distribution function argument>

b) If NVE is the null value, then the result is the null value.

c) If NVE is less than 0 (zero) or greater than 1 (one), then an exception condition is raised: data exception — numeric value out of range.

Varun Gupta (Inactive) added a comment - 2017-06-12 18:19 - edited More specifications for the inverse distribution function argument a) Let NVE be the value of the <inverse distribution function argument> b) If NVE is the null value, then the result is the null value. c) If NVE is less than 0 (zero) or greater than 1 (one), then an exception condition is raised: data exception — numeric value out of range.

Varun Gupta (Inactive) added a comment - 2017-06-14 16:07 - edited

Computation for PERCENTILE_CONT

Get the number of rows in the partition, denoted by N
RN = p*(N-1), where p denotes the argument to the PERCENTILE_CONT function
calculate the FRN(floor row number) and CRN(column row number for the group( FRN= floor(RN) and CRN = ceil(RN))
look up rows FRN and CRN
If (CRN = FRN = RN) then the result is (value of expression from row at RN)
Otherwise the result is
(CRN - RN) * (value of expression for row at FRN) +
(RN - FRN) * (value of expression for row at CRN)

Varun Gupta (Inactive) added a comment - 2017-06-14 16:07 - edited Computation for PERCENTILE_CONT Get the number of rows in the partition, denoted by N RN = p*(N-1), where p denotes the argument to the PERCENTILE_CONT function calculate the FRN(floor row number) and CRN(column row number for the group( FRN= floor(RN) and CRN = ceil(RN)) look up rows FRN and CRN If (CRN = FRN = RN) then the result is (value of expression from row at RN) Otherwise the result is (CRN - RN) * (value of expression for row at FRN) + (RN - FRN) * (value of expression for row at CRN)

Varun Gupta (Inactive) added a comment - 2017-06-14 16:09

Computation for PERCENTILE_DISC:

Get the number of rows in the partition
walk through the partition, in order, until we find the the first row with CUME_DIST() > function_argument
MEDIAN() = PERCENTILE_DISC(0.5

Varun Gupta (Inactive) added a comment - 2017-06-14 16:09 Computation for PERCENTILE_DISC: Get the number of rows in the partition walk through the partition, in order, until we find the the first row with CUME_DIST() > function_argument MEDIAN() = PERCENTILE_DISC(0.5

Varun Gupta (Inactive) added a comment - 2017-09-20 12:21 - edited

Datetime fields are not supported in the first iteration of percentile functions, have created a seperate issue for it.(MDEV-13854). After MDEV-13854 , we would have datetime fields support in percentile functions

Varun Gupta (Inactive) added a comment - 2017-09-20 12:21 - edited Datetime fields are not supported in the first iteration of percentile functions, have created a seperate issue for it.( MDEV-13854 ). After MDEV-13854 , we would have datetime fields support in percentile functions

Vicențiu Ciorbaru added a comment - 2017-10-28 14:41

Minor coding style fixes. Rebase and merge into 10.3 once BB clears it.

Vicențiu Ciorbaru added a comment - 2017-10-28 14:41 Minor coding style fixes. Rebase and merge into 10.3 once BB clears it.

Ján Regeš added a comment - 2017-11-30 09:02

@VicentiuCiorbaru - does it mean, that MEDIAN function will be in MariaDB 10.3?

Ján Regeš added a comment - 2017-11-30 09:02 @VicentiuCiorbaru - does it mean, that MEDIAN function will be in MariaDB 10.3?

People

Assignee:: Varun Gupta (Inactive)

Reporter:: David Thompson (Inactive)

Votes:: 1 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 2017-06-02 20:14

Updated:: 2022-01-03 17:26

Resolved:: 2017-11-05 05:07

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server