[MDEV-12985] support percentile and median window functions Created: 2017-06-02  Updated: 2022-01-03  Resolved: 2017-11-05

Status: Closed
Project: MariaDB Server
Component/s: Optimizer - Window functions
Fix Version/s: 10.3.3

Type: Task Priority: Major
Reporter: David Thompson (Inactive) Assignee: Varun Gupta (Inactive)
Resolution: Fixed Votes: 1
Labels: None

Issue Links:
Blocks
blocks MCOL-624 MariaDB 10.2 WF create MEDIAN, PERCEN... Closed
Duplicate
is duplicated by MDEV-4835 Add Median Function Closed
PartOf
is part of MDEV-12987 complete window function support for ... Open
Relates
relates to MDEV-13854 Supporting datetime fields in the ord... Open
relates to MDEV-27395 Named windows do not work with MEDIAN... Open
Sprint: 10.3.1-2, 10.3.3-1

 Description   

The percentile_cont and percentile_disc window functions are available in columnstore and many other databases. These allow calculation of percentiles. Percentile_cont will average 2 rows if one is not identified while Percentile_disc picks the first row in the window. Finally a median function should exist which is equivalent to percentile_cont(0.5).

These have slightly different syntax than other window function to specify the column:

percentile_cont(0.5) within group (order by amount) over (partition by owner) pct_cont,
percentile_disc(0.5) within group (order by amount) over (partition by owner) pct_disc

Some investigation

percentile_cont and percentile_disc are not specifically window functions. They originally are "ordered-set aggregate functions" (#1) which one can also use as window functions (#2):

Ordered-set aggreates

The syntax for case #1:

  percentile_cont(fraction) WITHIN GROUP (ORDER BY sort_expression)

Note the lack of OVER clause.
Ordered-set aggregate functions are supported by:

Ordered-set aggregates as window functions

Syntax for case #2 (ordered-set aggregate, used as window function)

PERCENTILE_DISC ( percentile )
WITHIN GROUP (ORDER BY expr)
OVER (  [ PARTITION BY expr_list ]  )

(BTW: note that PostgreSQL doesn't support ordered-set-aggregates-as-window functions: https://www.postgresql.org/docs/current/static/functions-window.html ,

any built-in or user-defined normal aggregate function (but not ordered-set or hypothetical-set aggregates) can be used as a window function)



 Comments   
Comment by Sergei Petrunia [ 2017-06-09 ]

[dthompson, a question from me and varun: Does ColumnStore need just "Ordered-set aggregates as window functions", or it needs both "Ordered-set aggregates as window functions" and "Ordered-set aggregates". IIRC it supported both?

Comment by David Thompson (Inactive) [ 2017-06-09 ]

ColumnStore 1.0 supported these only as window functions (and i learned a new term!). If you can easily add as a regular aggregate that's nice to have but not required.

Loop in David.Hall when you have a rough design as we will still need to reimplement the bottom end of the implementation on the columnstore side.

Comment by Varun Gupta (Inactive) [ 2017-06-11 ]

The grammar being:
	
<inverse distribution function> ::=
|<inverse distribution function type> <left paren> 
<inverse distribution function argument> <right paren><within group specification>					
 
<inverse distribution function argument> ::=
  <numeric value expression>					
 
<inverse distribution function type> ::=
   PERCENTILE_CONT					
 | PERCENTILE_DISC
					
<within group specification> ::=
  WITHIN GROUP <left paren> ORDER BY <sort specification> <right paren>
 

Comment by Varun Gupta (Inactive) [ 2017-06-11 ]

Specification for the percentile functions

 
For the <inverse distribution function>
a)  The <within group specification> shall contain a single <sort specification>
b)  The <inverse distribution function> shall not contain a <window function>, a 
     <set  function specification>, or a <query expression>
      c) Let DT be the declared type of the <value expression> simply contained in the 
          <sort specification>.
d)  DT shall be numeric or interval.
e)  The declared type of the result is
     Case:
      i)  If DT is numeric, then approximate numeric with implementation-defined precision.
     ii)  If DT is interval, then DT. 

Comment by Varun Gupta (Inactive) [ 2017-06-12 ]

More specifications for the inverse distribution function argument

a) Let NVE be the value of the <inverse distribution function argument> 
b) If NVE is the null value, then the result is the null value. 
c) If NVE is less than 0 (zero) or greater than 1 (one), then an exception condition is raised: data exception — numeric value out of range.

Comment by Varun Gupta (Inactive) [ 2017-06-14 ]

Computation for PERCENTILE_CONT

  1. Get the number of rows in the partition, denoted by N
  2. RN = p*(N-1), where p denotes the argument to the PERCENTILE_CONT function
  3. calculate the FRN(floor row number) and CRN(column row number for the group( FRN= floor(RN) and CRN = ceil(RN))
  4. look up rows FRN and CRN
  5. If (CRN = FRN = RN) then the result is (value of expression from row at RN)
  6. Otherwise the result is
    (CRN - RN) * (value of expression for row at FRN) +
    (RN - FRN) * (value of expression for row at CRN)
Comment by Varun Gupta (Inactive) [ 2017-06-14 ]

Computation for PERCENTILE_DISC:

  1. Get the number of rows in the partition
  2. walk through the partition, in order, until we find the the first row with CUME_DIST() > function_argument
  3. MEDIAN() = PERCENTILE_DISC(0.5
Comment by Varun Gupta (Inactive) [ 2017-09-20 ]

Datetime fields are not supported in the first iteration of percentile functions, have created a seperate issue for it.(MDEV-13854). After MDEV-13854 , we would have datetime fields support in percentile functions

Comment by Vicențiu Ciorbaru [ 2017-10-28 ]

Minor coding style fixes. Rebase and merge into 10.3 once BB clears it.

Comment by Ján Regeš [ 2017-11-30 ]

@VicentiuCiorbaru - does it mean, that MEDIAN function will be in MariaDB 10.3?

Generated at Thu Feb 08 08:02:01 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.