Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-12985

support percentile and median window functions

Details

    • 10.3.1-2, 10.3.3-1

    Description

      The percentile_cont and percentile_disc window functions are available in columnstore and many other databases. These allow calculation of percentiles. Percentile_cont will average 2 rows if one is not identified while Percentile_disc picks the first row in the window. Finally a median function should exist which is equivalent to percentile_cont(0.5).

      These have slightly different syntax than other window function to specify the column:

      percentile_cont(0.5) within group (order by amount) over (partition by owner) pct_cont,
      percentile_disc(0.5) within group (order by amount) over (partition by owner) pct_disc

      Some investigation

      percentile_cont and percentile_disc are not specifically window functions. They originally are "ordered-set aggregate functions" (#1) which one can also use as window functions (#2):

      Ordered-set aggreates

      The syntax for case #1:

        percentile_cont(fraction) WITHIN GROUP (ORDER BY sort_expression)
      

      Note the lack of OVER clause.
      Ordered-set aggregate functions are supported by:

      Ordered-set aggregates as window functions

      Syntax for case #2 (ordered-set aggregate, used as window function)

      PERCENTILE_DISC ( percentile )
      WITHIN GROUP (ORDER BY expr)
      OVER (  [ PARTITION BY expr_list ]  )
      

      (BTW: note that PostgreSQL doesn't support ordered-set-aggregates-as-window functions: https://www.postgresql.org/docs/current/static/functions-window.html ,

      any built-in or user-defined normal aggregate function (but not ordered-set or hypothetical-set aggregates) can be used as a window function)

      Attachments

        Issue Links

          Activity

            dthompson David Thompson (Inactive) created issue -
            dthompson David Thompson (Inactive) made changes -
            Field Original Value New Value
            dthompson David Thompson (Inactive) made changes -
            varun Varun Gupta (Inactive) made changes -
            Assignee Varun Gupta [ varun ]
            varun Varun Gupta (Inactive) made changes -
            Status Open [ 1 ] In Progress [ 3 ]
            psergei Sergei Petrunia made changes -
            Description The percentile_cont and percentile_disc window functions are available in columnstore and many other databases. These allow calculation of percentiles. Percentile_cont will average 2 rows if one is not identified while Percentile_disc picks the first row in the window. Finally a median function should exist which is equivalent to percentile_cont(0.5).

            These have slightly different syntax than other window function to specify the column:

            percentile_cont(0.5) within group (order by amount) over (partition by owner) pct_cont,
            percentile_disc(0.5) within group (order by amount) over (partition by owner) pct_disc
            The percentile_cont and percentile_disc window functions are available in columnstore and many other databases. These allow calculation of percentiles. Percentile_cont will average 2 rows if one is not identified while Percentile_disc picks the first row in the window. Finally a median function should exist which is equivalent to percentile_cont(0.5).

            These have slightly different syntax than other window function to specify the column:

            percentile_cont(0.5) within group (order by amount) over (partition by owner) pct_cont,
            percentile_disc(0.5) within group (order by amount) over (partition by owner) pct_disc

            h2. Some investigation
            percentile_cont and percentile_disc are not specifically window functions. They originally are "ordered-set aggregate functions" (#1) which one can also use as window functions (#2):

            h3. Ordered-set aggreates
            The syntax for case #1:
            {noformat}
              percentile_cont(fraction) WITHIN GROUP (ORDER BY sort_expression)
            {noformat}
            Note the lack of OVER clause.
            Ordered-set aggregate functions are supported by:
            * https://www.postgresql.org/docs/current/static/functions-aggregate.html
            * http://docs.aws.amazon.com/redshift/latest/dg/c_Aggregate_Functions.html
            Neither MariaDB nor MySQL support any "ordered-set aggregate functions".

            h3. Ordered-set aggregates as window functions
            Syntax for case #2 (ordered-set aggregate, used as window function)

            * http://docs.aws.amazon.com/redshift/latest/dg/r_WF_PERCENTILE_DISC.html
            * https://docs.oracle.com/cd/B12037_01/server.101/b10759/functions100.htm#i1000909

            {noformat}
            PERCENTILE_DISC ( percentile )
            WITHIN GROUP (ORDER BY expr)
            OVER ( [ PARTITION BY expr_list ] )
            {noformat}

            (BTW: note that PostgreSQL doesn't support ordered-set-aggregates-as-window functions: https://www.postgresql.org/docs/current/static/functions-window.html ,
            {quote}any built-in or user-defined normal aggregate function (but not ordered-set or hypothetical-set aggregates) can be used as a window function)
            {quote}

            [dthompson, a question from me and varun: Does ColumnStore need just "Ordered-set aggregates as window functions", or it needs both "Ordered-set aggregates as window functions" and "Ordered-set aggregates". IIRC it supported both?

            psergei Sergei Petrunia added a comment - [ dthompson , a question from me and varun : Does ColumnStore need just "Ordered-set aggregates as window functions", or it needs both "Ordered-set aggregates as window functions" and "Ordered-set aggregates". IIRC it supported both?

            ColumnStore 1.0 supported these only as window functions (and i learned a new term!). If you can easily add as a regular aggregate that's nice to have but not required.

            Loop in David.Hall when you have a rough design as we will still need to reimplement the bottom end of the implementation on the columnstore side.

            dthompson David Thompson (Inactive) added a comment - ColumnStore 1.0 supported these only as window functions (and i learned a new term!). If you can easily add as a regular aggregate that's nice to have but not required. Loop in David.Hall when you have a rough design as we will still need to reimplement the bottom end of the implementation on the columnstore side.

            The grammar being:
            	
            <inverse distribution function> ::=
            |<inverse distribution function type> <left paren> 
            <inverse distribution function argument> <right paren><within group specification>					
             
            <inverse distribution function argument> ::=
              <numeric value expression>					
             
            <inverse distribution function type> ::=
               PERCENTILE_CONT					
             | PERCENTILE_DISC
            					
            <within group specification> ::=
              WITHIN GROUP <left paren> ORDER BY <sort specification> <right paren>
             
            

            varun Varun Gupta (Inactive) added a comment - The grammar being: <inverse distribution function> ::= |<inverse distribution function type> <left paren> <inverse distribution function argument> <right paren><within group specification>   <inverse distribution function argument> ::= <numeric value expression> <inverse distribution function type> ::= PERCENTILE_CONT | PERCENTILE_DISC <within group specification> ::= WITHIN GROUP <left paren> ORDER BY <sort specification> <right paren>

            Specification for the percentile functions

             
            For the <inverse distribution function>
            a)  The <within group specification> shall contain a single <sort specification>
            b)  The <inverse distribution function> shall not contain a <window function>, a 
                 <set  function specification>, or a <query expression>
                  c) Let DT be the declared type of the <value expression> simply contained in the 
                      <sort specification>.
            d)  DT shall be numeric or interval.
            e)  The declared type of the result is
                 Case:
                  i)  If DT is numeric, then approximate numeric with implementation-defined precision.
                 ii)  If DT is interval, then DT. 
            
            

            varun Varun Gupta (Inactive) added a comment - Specification for the percentile functions   For the <inverse distribution function> a) The <within group specification> shall contain a single <sort specification> b) The <inverse distribution function> shall not contain a <window function>, a <set function specification>, or a <query expression> c) Let DT be the declared type of the <value expression> simply contained in the <sort specification>. d) DT shall be numeric or interval. e) The declared type of the result is Case: i) If DT is numeric, then approximate numeric with implementation-defined precision. ii) If DT is interval, then DT.
            varun Varun Gupta (Inactive) added a comment - - edited

            More specifications for the inverse distribution function argument

            a) Let NVE be the value of the <inverse distribution function argument> 
            b) If NVE is the null value, then the result is the null value. 
            c) If NVE is less than 0 (zero) or greater than 1 (one), then an exception condition is raised: data exception — numeric value out of range.
            

            varun Varun Gupta (Inactive) added a comment - - edited More specifications for the inverse distribution function argument a) Let NVE be the value of the <inverse distribution function argument> b) If NVE is the null value, then the result is the null value. c) If NVE is less than 0 (zero) or greater than 1 (one), then an exception condition is raised: data exception — numeric value out of range.
            varun Varun Gupta (Inactive) added a comment - - edited

            Computation for PERCENTILE_CONT

            1. Get the number of rows in the partition, denoted by N
            2. RN = p*(N-1), where p denotes the argument to the PERCENTILE_CONT function
            3. calculate the FRN(floor row number) and CRN(column row number for the group( FRN= floor(RN) and CRN = ceil(RN))
            4. look up rows FRN and CRN
            5. If (CRN = FRN = RN) then the result is (value of expression from row at RN)
            6. Otherwise the result is
              (CRN - RN) * (value of expression for row at FRN) +
              (RN - FRN) * (value of expression for row at CRN)
            varun Varun Gupta (Inactive) added a comment - - edited Computation for PERCENTILE_CONT Get the number of rows in the partition, denoted by N RN = p*(N-1), where p denotes the argument to the PERCENTILE_CONT function calculate the FRN(floor row number) and CRN(column row number for the group( FRN= floor(RN) and CRN = ceil(RN)) look up rows FRN and CRN If (CRN = FRN = RN) then the result is (value of expression from row at RN) Otherwise the result is (CRN - RN) * (value of expression for row at FRN) + (RN - FRN) * (value of expression for row at CRN)

            Computation for PERCENTILE_DISC:

            1. Get the number of rows in the partition
            2. walk through the partition, in order, until we find the the first row with CUME_DIST() > function_argument
            3. MEDIAN() = PERCENTILE_DISC(0.5
            varun Varun Gupta (Inactive) added a comment - Computation for PERCENTILE_DISC: Get the number of rows in the partition walk through the partition, in order, until we find the the first row with CUME_DIST() > function_argument MEDIAN() = PERCENTILE_DISC(0.5
            varun Varun Gupta (Inactive) made changes -
            Fix Version/s 10.3 [ 22126 ]
            ralf.gebhardt Ralf Gebhardt made changes -
            Epic Link PT-60 [ 62159 ]
            ratzpo Rasmus Johansson (Inactive) made changes -
            ratzpo Rasmus Johansson (Inactive) made changes -
            Sprint 10.3.1-2 [ 174 ]
            ratzpo Rasmus Johansson (Inactive) made changes -
            Rank Ranked higher
            varun Varun Gupta (Inactive) made changes -
            Assignee Varun Gupta [ varun ] Vicentiu Ciorbaru [ cvicentiu ]
            Status In Progress [ 3 ] In Review [ 10002 ]
            cvicentiu Vicențiu Ciorbaru made changes -
            Status In Review [ 10002 ] Stalled [ 10000 ]
            cvicentiu Vicențiu Ciorbaru made changes -
            Assignee Vicentiu Ciorbaru [ cvicentiu ] Varun Gupta [ varun ]
            varun Varun Gupta (Inactive) made changes -
            Status Stalled [ 10000 ] In Progress [ 3 ]
            varun Varun Gupta (Inactive) made changes -
            varun Varun Gupta (Inactive) added a comment - - edited

            Datetime fields are not supported in the first iteration of percentile functions, have created a seperate issue for it.(MDEV-13854). After MDEV-13854 , we would have datetime fields support in percentile functions

            varun Varun Gupta (Inactive) added a comment - - edited Datetime fields are not supported in the first iteration of percentile functions, have created a seperate issue for it.( MDEV-13854 ). After MDEV-13854 , we would have datetime fields support in percentile functions
            serg Sergei Golubchik made changes -
            Sprint 10.3.1-2 [ 174 ] 10.3.1-2, 10.3.3-1 [ 174, 200 ]
            varun Varun Gupta (Inactive) made changes -
            Assignee Varun Gupta [ varun ] Vicentiu Ciorbaru [ cvicentiu ]
            Status In Progress [ 3 ] In Review [ 10002 ]

            Minor coding style fixes. Rebase and merge into 10.3 once BB clears it.

            cvicentiu Vicențiu Ciorbaru added a comment - Minor coding style fixes. Rebase and merge into 10.3 once BB clears it.
            cvicentiu Vicențiu Ciorbaru made changes -
            Assignee Vicentiu Ciorbaru [ cvicentiu ] Varun Gupta [ varun ]
            Status In Review [ 10002 ] Stalled [ 10000 ]
            varun Varun Gupta (Inactive) made changes -
            Fix Version/s 10.3.3 [ 22644 ]
            Fix Version/s 10.3 [ 22126 ]
            Resolution Fixed [ 1 ]
            Status Stalled [ 10000 ] Closed [ 6 ]

            @VicentiuCiorbaru - does it mean, that MEDIAN function will be in MariaDB 10.3?

            jan.reges Ján Regeš added a comment - @VicentiuCiorbaru - does it mean, that MEDIAN function will be in MariaDB 10.3?
            serg Sergei Golubchik made changes -
            Workflow MariaDB v3 [ 81072 ] MariaDB v4 [ 133274 ]
            alice Alice Sherepa made changes -

            People

              varun Varun Gupta (Inactive)
              dthompson David Thompson (Inactive)
              Votes:
              1 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.