The percentile_cont and percentile_disc window functions are available in columnstore and many other databases. These allow calculation of percentiles. Percentile_cont will average 2 rows if one is not identified while Percentile_disc picks the first row in the window. Finally a median function should exist which is equivalent to percentile_cont(0.5).
These have slightly different syntax than other window function to specify the column:
percentile_cont(0.5) within group (order by amount) over (partition by owner) pct_cont,
percentile_disc(0.5) within group (order by amount) over (partition by owner) pct_disc
Some investigation
percentile_cont and percentile_disc are not specifically window functions. They originally are "ordered-set aggregate functions" (#1) which one can also use as window functions (#2):
Ordered-set aggreates
The syntax for case #1:
percentile_cont(fraction) WITHIN GROUP (ORDER BY sort_expression)
Note the lack of OVER clause.
Ordered-set aggregate functions are supported by:
[dthompson, a question from me and varun: Does ColumnStore need just "Ordered-set aggregates as window functions", or it needs both "Ordered-set aggregates as window functions" and "Ordered-set aggregates". IIRC it supported both?
Sergei Petrunia
added a comment - [ dthompson , a question from me and varun : Does ColumnStore need just "Ordered-set aggregates as window functions", or it needs both "Ordered-set aggregates as window functions" and "Ordered-set aggregates". IIRC it supported both?
ColumnStore 1.0 supported these only as window functions (and i learned a new term!). If you can easily add as a regular aggregate that's nice to have but not required.
Loop in David.Hall when you have a rough design as we will still need to reimplement the bottom end of the implementation on the columnstore side.
David Thompson (Inactive)
added a comment - ColumnStore 1.0 supported these only as window functions (and i learned a new term!). If you can easily add as a regular aggregate that's nice to have but not required.
Loop in David.Hall when you have a rough design as we will still need to reimplement the bottom end of the implementation on the columnstore side.
|<inverse distribution function type> <left paren>
<inverse distribution function argument> <right paren><within group specification>
<inverse distribution function argument> ::=
<numeric value expression>
<inverse distribution function type> ::=
PERCENTILE_CONT
| PERCENTILE_DISC
<within group specification> ::=
WITHIN GROUP <left paren> ORDER BY <sort specification> <right paren>
Varun Gupta (Inactive)
added a comment -
The grammar being:
<inverse distribution function> ::=
|<inverse distribution function type> <left paren>
<inverse distribution function argument> <right paren><within group specification>
<inverse distribution function argument> ::=
<numeric value expression>
<inverse distribution function type> ::=
PERCENTILE_CONT
| PERCENTILE_DISC
<within group specification> ::=
WITHIN GROUP <left paren> ORDER BY <sort specification> <right paren>
a) The <within group specification> shall contain a single <sort specification>
b) The <inverse distribution function> shall not contain a <window function>, a
<set function specification>, or a <query expression>
c) Let DT be the declared type of the <value expression> simply contained in the
<sort specification>.
d) DT shall be numeric or interval.
e) The declared type of the result is
Case:
i) If DT is numeric, then approximate numeric with implementation-defined precision.
ii) If DT is interval, then DT.
Varun Gupta (Inactive)
added a comment - Specification for the percentile functions
For the <inverse distribution function>
a) The <within group specification> shall contain a single <sort specification>
b) The <inverse distribution function> shall not contain a <window function>, a
<set function specification>, or a <query expression>
c) Let DT be the declared type of the <value expression> simply contained in the
<sort specification>.
d) DT shall be numeric or interval.
e) The declared type of the result is
Case:
i) If DT is numeric, then approximate numeric with implementation-defined precision.
ii) If DT is interval, then DT.
More specifications for the inverse distribution function argument
a) Let NVE be the value of the <inverse distribution function argument>
b) If NVE is the null value, then the result is the null value.
c) If NVE is less than 0 (zero) or greater than 1 (one), then an exception condition is raised: data exception — numeric value out of range.
Varun Gupta (Inactive)
added a comment - - edited More specifications for the inverse distribution function argument
a) Let NVE be the value of the <inverse distribution function argument>
b) If NVE is the null value, then the result is the null value.
c) If NVE is less than 0 (zero) or greater than 1 (one), then an exception condition is raised: data exception — numeric value out of range.
Get the number of rows in the partition, denoted by N
RN = p*(N-1), where p denotes the argument to the PERCENTILE_CONT function
calculate the FRN(floor row number) and CRN(column row number for the group( FRN= floor(RN) and CRN = ceil(RN))
look up rows FRN and CRN
If (CRN = FRN = RN) then the result is (value of expression from row at RN)
Otherwise the result is
(CRN - RN) * (value of expression for row at FRN) +
(RN - FRN) * (value of expression for row at CRN)
Varun Gupta (Inactive)
added a comment - - edited Computation for PERCENTILE_CONT
Get the number of rows in the partition, denoted by N
RN = p*(N-1), where p denotes the argument to the PERCENTILE_CONT function
calculate the FRN(floor row number) and CRN(column row number for the group( FRN= floor(RN) and CRN = ceil(RN))
look up rows FRN and CRN
If (CRN = FRN = RN) then the result is (value of expression from row at RN)
Otherwise the result is
(CRN - RN) * (value of expression for row at FRN) +
(RN - FRN) * (value of expression for row at CRN)
walk through the partition, in order, until we find the the first row with CUME_DIST() > function_argument
MEDIAN() = PERCENTILE_DISC(0.5
Varun Gupta (Inactive)
added a comment - Computation for PERCENTILE_DISC:
Get the number of rows in the partition
walk through the partition, in order, until we find the the first row with CUME_DIST() > function_argument
MEDIAN() = PERCENTILE_DISC(0.5
Datetime fields are not supported in the first iteration of percentile functions, have created a seperate issue for it.(MDEV-13854). After MDEV-13854 , we would have datetime fields support in percentile functions
Varun Gupta (Inactive)
added a comment - - edited Datetime fields are not supported in the first iteration of percentile functions, have created a seperate issue for it.( MDEV-13854 ). After MDEV-13854 , we would have datetime fields support in percentile functions
[dthompson, a question from me and varun: Does ColumnStore need just "Ordered-set aggregates as window functions", or it needs both "Ordered-set aggregates as window functions" and "Ordered-set aggregates". IIRC it supported both?