[MDEV-7773] Aggregate stored functions Created: 2015-03-13  Updated: 2022-06-15  Resolved: 2017-12-04

Status: Closed
Project: MariaDB Server
Component/s: Data Definition - Procedure
Fix Version/s: 10.3.3

Type: Task Priority: Critical
Reporter: Sergei Golubchik Assignee: Varun Gupta (Inactive)
Resolution: Fixed Votes: 3
Labels: 10.3-beta, gsoc15, gsoc16

Issue Links:
Relates
relates to MDEV-14520 Custom aggregate functions work incor... Closed
relates to MDEV-24580 Aggregate stored function return alwa... Open
relates to MCOL-523 support user defined aggregate functions Closed
relates to MDEV-11263 Aggregate Window Functions Stalled
relates to MDEV-11301 SUGGESTION: Increase the number of Ag... Open
relates to MDEV-14564 support FOR loop in stored aggregate ... Stalled
relates to MDEV-28849 I_S tables don't tell that the stored... Open
Sprint: 10.3.1-1, 10.3.1-2, 10.3.3-1

 Description   

With CREATE FUNCTION one can create functions in SQL, but this syntax doesn't allow one to create an aggregate function (like SUM, AVG, etc). This task is to add support for aggregate stored functions.

Syntax

is not decided yet. SQL Standard 2003 doesn't support aggregate stored functions (may be the newer standard does). Oracle, PostgreSQL, HSQL (may be more) all implement aggregate stored functions using their own incompatible syntax extensions. SQL Server and DB2 do not support creating new aggregate stored functions in SQL.

The syntax should at least allow for

  • the code to return the function value
  • the code to be called per row in a group, it accumulates data, does not return a value
  • storage that is preserved between calls within a group

and possibly

  • code to invoke between groups to reset the storage
  • code to remove data from the group (useful for window functions)


 Comments   
Comment by Sergei Golubchik [ 2016-02-19 ]

A possible approach is to have only one function that does everything — accumulates and returns the result. This is what HSQL has:

 CREATE AGGREGATE FUNCTION udavg(IN x INTEGER, IN flag BOOLEAN, INOUT acc BIGINT, INOUT counter INT)
...

here the first argument — x — is the actual argument of the aggregate function (called as SELECT udagv(x) ...), the second — flag — tells where to accumulate or return the result, the last two — acc and counter — are the storage, their value is preserved between calls and automatically reset to NULL when a new group starts.

The function then does something like

  IF flag THEN
    -- return the value
  ELSE
    -- update count and acc with the new value of x
  END IF;

Comment by Sergei Golubchik [ 2016-02-19 ]

PostgreSQL assembles an aggregate function out of two regular functions. The state, like in the HSQL case, is an INOUT argument for these functions. The syntax looks like

CREATE FUNCTION norm1 (... ) ... ;
CREATE FUNCTION norm2 (... ) ... ;
CREATE AGGREGATE (...) ( SFUNC=norm1, FINALFUNC=norm2, ... )

it also allows to specify the initial value for the state and if the function is MIN or MAX (that is, can be answered by taking the first or last value from the index).

Comment by Sergei Golubchik [ 2016-02-19 ]

Oracle uses an object-oriented approach. An aggregate function is an object, the state is stored in the object, initialize/iterate/terminate/merge are the methods.

create type SecondMaxImpl as object
(
  max NUMBER, -- highest value seen so far 
  secmax NUMBER, -- second highest value seen so far
  static function ODCIAggregateInitialize(sctx IN OUT SecondMaxImpl) return number,
  member function ODCIAggregateIterate(self IN OUT SecondMaxImpl, value IN number) return number,
  member function ODCIAggregateTerminate(self IN SecondMaxImpl, returnValue OUT number, flags IN number) return number,
  member function ODCIAggregateMerge(self IN OUT SecondMaxImpl, ctx2 IN SecondMaxImpl) return number
);
create type body SecondMaxImpl is 
static function ODCIAggregateInitialize(sctx IN OUT SecondMaxImpl) return number is 
begin
  sctx := SecondMaxImpl(0, 0);
  return ODCIConst.Success;
end;
-- define other methods accordingly...
CREATE FUNCTION SecondMax (input NUMBER) RETURN NUMBER 
PARALLEL_ENABLE AGGREGATE USING SecondMaxImpl;

Comment by Sergei Golubchik [ 2016-02-19 ]

Another approach is to use a pre-defined cursor that iterates the group. Like this:

CREATE AGGREGATE FUNCTION avg(x DOUBLE) RETURNS DOUBLE
BEGIN
  DECLARE count INT DEFAULT 0;
  DECLARE sum DOUBLE DEFAULT 0;
  DECLARE CONTINUE HANDLER FOR SQLSTATE '02000' RETURN sum/count;
  LOOP
    FETCH GROUP NEXT ROW;
    SET count:=count+1;
    SET sum:=sum+x;
  END LOOP;
END

Note that GROUP keyword is reserved, so this syntax can be unambiguously distinguished from FETCH cursor_name.

Comment by Sergei Golubchik [ 2016-02-19 ]

Another possibility is to use a complex CREATE FUNCTION with many blocks:

CREATE AGGREGATE FUNCTION avg(x DOUBLE) RETURNS DOUBLE
  DECLARE count INT DEFAULT 0;  -- note, state in declarations before the first BEGIN
  DECLARE sum DOUBLE DEFAULT 0;
BEGIN                           -- this is called for every row in a group
  SET count:=count+1;
  SET sum:=sum+x;
END;
BEGIN                           -- this returns the result
  RETURN sum/count;
END;

but this is too confusing. BEGIN/END blocks need keywords or labels to specify what they do.

Comment by varun gupta [ 2016-02-29 ]

Can you give more details on the argument x, what i understand that it is one of the attributes of the table, and we need to have an aggregate function for the attribute x

Comment by Sergei Golubchik [ 2016-02-29 ]

x is the function argument. Like in

SELECT avg(val) FROM table1 GROUP BY grn;

Comment by varun gupta [ 2016-02-29 ]

Yeah understand it is the argument to the function.
When we write
SELECT avg(val) FROM table1 GROUP BY grn;
val is one of the columns of our table(attribute). Am I correct?

Comment by Oleksandr Byelkin [ 2016-03-30 ]

val should be an arbitrary expression.

Comment by Sergei Golubchik [ 2016-04-08 ]

Window function support

To work efficiently over windows, an aggregate function should be able to remove a row from a group. This can be implemented in all syntax variants discussed above. In all cases row-removal functionality is optional, it is an optimization, but a aggregate function can work over windows without it too.

Adding support for row removal is not part of this MDEV-7773, but it should be possible to add that functionality later, MDEV-7773 should be compatible with this future extension.

HSQL

The second argument IN becomes an ENUM(ADD,REMOVE,RESULT)

PostgreSQL

The definition supports a new clause REMOVEFUNC=

Oracle

The object gets a new method remove

Cursor approach

A new syntax FETCH GROUP REMOVED ROW. A value is generated when both cursors ("next" and "removed") have returned SQLSTATE 02000. Note: This is not enough, in addition return needs not to return, but that would be too confusing, wouldn't it?

Many blocks approach

One more block, that is invoked to remove a row

Comment by Sergei Golubchik [ 2016-04-09 ]

The complete example of the AVGOW (average over window) function using the cursor approach:

CREATE AGGREGATE FUNCTION avgow(x DOUBLE) RETURNS DOUBLE
BEGIN
  DECLARE count INT DEFAULT 0;
  DECLARE sum DOUBLE DEFAULT 0;
  DECLARE loop_done INT;
  DECLARE CONTINUE HANDLER FOR SQLSTATE '02000' SET loop_done=1;
  LOOP
    SET loop_done=0;
    WHILE loop_done = 0
      FETCH GROUP REMOVED ROW;
      SET count:=count-1;
      SET sum:=sum-x;
    END WHILE;
    SET loop_done=0;
    WHILE loop_done = 0
      FETCH GROUP NEXT ROW;
      SET count:=count+1;
      SET sum:=sum+x;
    END WHILE;
    RETURN IF(count,sum/count,NULL);
  END LOOP;
END

It is more complex than necessary, but shows all the features. Note, that

  • there are two cursors, not one (the presence of a second cursor makes this function “window-aware” and enables two-cursor optimization).
  • after return execution continues! The function is restarted for every window partition, but it is not restarted for every frame inside a window.
  • The function exhausts the REMOVED cursor first, then the NEXT cursor. This order is required. There is no logical reason for that, it is how our window functions work at the moment. This limitation can (and should) be removed, but let's do it as the next step.
  • This function is universal, it works both for GROUP BY and for OVER WINDOW. In the former case the “frame” is equal to the whole group, so the function is restarted after every return (in other words, it has the usual semantics).
Comment by Sergei Golubchik [ 2016-04-29 ]

Another problem of the cursor approach — if we ever will support non-SQL stored procedures (like Antony Curtis work), this will unlikely to be able to use cursors. At least, not generally for all languages.

Comment by Oleksandr Byelkin [ 2016-04-29 ]

There should be ways to read cursors from any stored procedures (special functions or something like this) in any case we get control at that moment and do what we need. It is as difficult as for our procedures IMHO.

Comment by Alexander Barkov [ 2016-05-01 ]

The proposed syntax looks fine for simple cases.

I'd propose to think towards another direction though:
implement aggregate functions as SQL classes!

Internally all aggregate functions are implemented as classes in the Item_sum hierarchy and they have various methods and members. To be able to support all features that a built-in aggregate function supports (like WINDOW functionality, DISTINCT, etc) we could implement stored functions as classes as well, using the SQL standard definitions for user-defined types.

See the "<user-defined type definition>" in the SQL standard.

A skeleton for a class definition would look like this:

-- Define an aggregate function handler
-- Derive it from some top-level built-in abstract class MariaDBAggregateFunctionHandler
CREATE TYPE handler_avgow UNDER MariaDBAggregateFunctionHandler
AS
(
 count INT DEFAULT 0,  # member 1
 sum DOUBLE DEFAULT 0  # member 2
)
OVERRIDING METHOD val_real() RETURNS DOUBLE
OVERRIDING METHOD fix_length_and_dec() RETURNS BOOLEAN
OVERRIDING METHOD reset_and_add() RETURNS BOOLEAN
...
;

After a class is defined, we can create methods using the standard syntax:
(in SQL method cannot be defined "inline", so there will be separate statements for methods:

-- Define aggregate handler methods
CREATE METHOD val_real() FOR handler_avgov
 BEGIN
   RETURN IF(count,sum/count,NULL);
 END;
 CREATE METHOD fix_length_and_dec() ...;
 ...

Now, when we have a class with its methods created with the standard syntax, we can create an aggregate function using non-standard syntax, for example like this:

-- Define the function itself
CREATE AGGREGATE FUNCTION avgow(x DOUBLE) RETURNS DOUBLE
USING handler_avgrow;

For non-SQL definitions we could use some syntax (perhaps it already exists in the SQL standard, I saw something like EXTERNAL NAME '...', which looks very related) to load implementations from external shared libraries. This syntax could include <language clause>.
How SQL and non-SQL definitions would relate:

  • The CREATE TYPE statement will be the same for an SQL and non-SQL implementation.
  • The CREATE METHOD statement will be different for SQL and non-SQL!!! (explicit SQL routine definition vs external module reference)
  • The CREATE AGGREGATE FUNCTION statement will be the same for SQL and non-SQL

Advantages of the class-based conception:

  • A great step towards the SQL standard. This CREATE TYPE useful per se, to create user-defined data types.
  • Splits the code logically. You don't have to care about the order of different SQL code pieces, like cursors in the above example. This code will reside in different methods, so the server will make sure to execute them in proper order. Also, you don't have to care about signals. All together this should give better performance, and better user-defined SQL code readability.
  • SQL and Non-SQL have very close interface (only CREATE METHOD differs)
  • Classes can be used to add SQL code for other user-defined purposes (not only for aggregate functions), e.g.:
    • Hybrid type functions, which need not only val_xxx() (RETURN in terms of SQL), but also fix_length_and_dec() to decide their return type depending on the parameters passed.
    • PROCEDURE in SQL (rewrite the SELECT data set into a different data set)
    • SQL code wrappers around pluggable modules for various server components (for example parser or optimizer), engines, and all other plugin types. A user defined parser could derive from the built-in parser to call the derived parse method then additionally store some statistics somewhere using SQL. An SQL wrapper for an engine could override some engine methods (using SQL) to disable certain components (e.g. there is a request to disable everything in ConnectSE, except ODBC). So there could be two instances of ConnectSE: the standard full-featured one with full functionality available for authorized users, and a derived ODBC-only one for non-authorized users.
Comment by Sergei Golubchik [ 2016-05-06 ]

This is very similar to Oracle approach (see above).

But this is way too verbose and complex in my opinion. It could be simplified
as:

CREATE AGGREGATE FUNCTION avgow
AS
(
 count INT DEFAULT 0,  -- member 1
 sum DOUBLE DEFAULT 0  -- member 2
)
OVERRIDING METHOD val_real() RETURNS DOUBLE
OVERRIDING METHOD fix_length_and_dec() RETURNS BOOLEAN
OVERRIDING METHOD reset_and_add() RETURNS BOOLEAN
...
;
CREATE METHOD val_real() FOR avgow
 BEGIN
   RETURN IF(count,sum/count,NULL);
 END;
CREATE METHOD fix_length_and_dec() ...;
 ...

or even as

CREATE AGGREGATE FUNCTION avgow
AS
(
 count INT DEFAULT 0,  -- member 1
 sum DOUBLE DEFAULT 0  -- member 2
);
CREATE METHOD val_real() FOR avgow
 BEGIN
   RETURN IF(count,sum/count,NULL);
 END;
CREATE METHOD fix_length_and_dec() ...;
 ...

But in certain cases it is more limited than cursor-based approach, it doesn't
allow one to implement a median or a percentile function, because you cannot
have temporary tables or arrays in the AS ( ... ) clause. Still, both
approaches can be combined:

CREATE AGGREGATE FUNCTION avgow(x DOUBLE) RETURNS DOUBLE
BEGIN
  -- ... aggreagate function body, as above
END;
CREATE METHOD fix_length_and_dec() ... FOR avgow...
 ...

And the beauty of this is that it perfectly applies to non-aggregate functions
too:

CREATE FUNCTION foo(x DOUBLE) RETURNS DECIMAL(3,1)
BEGIN
  -- ...
END;
CREATE METHOD fix_length_and_dec() ... FOR foo...
 ...

Comment by Sergei Golubchik [ 2016-05-06 ]

Hm, may be omitting the OVERRIDING clause is not a good idea — it might allow someone from a concurrent connection to invoke the function before it's fully defined. And that might be dangerous, if, for example, one of the overriding methods does some kind of the access checks.

But OVERRIDING can be made optional. In the above case one will specify OVERRIDING, but often one will be able to omit it.

Or, may be, it'll be simpler to stick to the standard here and require the OVERRIDING clause?

Comment by Alexander Barkov [ 2016-05-06 ]

Good idea to build CREATE TYPE inside CREATE [AGGREGATE] FUNCTION.
We should probably try to define method bodies inline (without having separate CREATE METHOD statements).
This would solve the problem with concurrent connections.

Comment by Vicențiu Ciorbaru [ 2017-07-10 ]

The current test case seems to corrupt the procs table:

delimiter |;
 
create aggregate function agg_sum(x INT) returns double
begin
declare z double default 0;
declare continue handler for not found return z;
 
loop
fetch group next row;
set z = z + x;
end loop;
end|
 
delimiter ;|
 
create table t1 (id int, salary int);
 
INSERT INTO t1 VALUES (1, 100), (2, 40), (3, 6);
 
show create function agg_sum;
alter function agg_sum aggregate none;
show create function agg_sum; 

Final query fails with:

# This fails with 'show create function agg_sum' failed: 1457: Failed to load routine test.agg_sum. The table mysql.proc is missing, corrupt, or contains bad data (internal code -6)

Comment by Vicențiu Ciorbaru [ 2017-10-28 ]

Just a few minor dead-code removal and history cleanup, then we can push.

Comment by Varun Gupta (Inactive) [ 2017-11-26 ]

A final rebase was required, resolved the conflicts and have sent it to cvicentiu for review.
Have pushed it to bb-10.3-varun for final testing

Comment by Elena Stepanova [ 2017-11-29 ]

I haven't got anything of significance in a round of tests on bb-10.3-varun 7d8c5ee4039.
Testing of correctness of results is limited due to numerous existing problems in this area, unrelated to the new functionality. I haven't got any functionality-specific crashes.
After fixing MDEV-14520 and addressing all reviews, please go ahead and push to 10.3. If reviews require significant changes in the code, please notify so that I re-run the tests.

Please make sure the functionality is documented in the KB, if not done yet.

Among other things, I think it should be specifically mentioned in the documentation that unlike built-in functions, the custom ones don't provide automatic type recognition and act according to the types specified in the function itself. For a simple example, consider the following

create or replace table t1 (i int, c char(8), ci char(8));
insert into t1 values (10,'foo','10'),(2,'bar','2');
 
delimiter $
create or replace aggregate function agg_max1(x INT) returns INT
begin
  declare m INT default NULL;
  declare continue handler for not found return m;
  loop
   fetch group next row;
   if (x is not null) and (m is null or m < x) then
      set m= x;
   end if;
  end loop;
end $
 
create or replace aggregate function agg_max2(x BLOB) returns BLOB
begin
  declare m BLOB default NULL;
  declare continue handler for not found return m;
  loop
   fetch group next row;
   if (x is not null) and (m is null or m < x) then
      set m= x;
   end if;
  end loop;
end $
delimiter ;

Here, column `i` is integer, column `ci` is text which contains integer values, and column `c` is text.
The in-built function MAX will treat each of them accordingly – the result will be based on integer comparison for `i`, on text comparison for `ci` and `c`.

MariaDB [test]> select max(i), max(ci), max(c) from t1;
+--------+---------+--------+
| max(i) | max(ci) | max(c) |
+--------+---------+--------+
|     10 | 2       | foo    |
+--------+---------+--------+
1 row in set (0.00 sec)

Definitions of functions agg_max1 and agg_max2 are identical, except that agg_max1 works with integers and agg_max2 with blobs.

On the same data, agg_max2 will do text comparison for all columns, regardless their initial type

MariaDB [test]> select agg_max2(i), agg_max2(ci), agg_max2(c) from t1;
+-------------+--------------+-------------+
| agg_max2(i) | agg_max2(ci) | agg_max2(c) |
+-------------+--------------+-------------+
| 2           | 2            | foo         |
+-------------+--------------+-------------+
1 row in set (0.00 sec)

Function agg_max1 will do integer comparison for anything that looks like integer – in this case, for `i` and for `ci`:

MariaDB [test]> select agg_max1(i), agg_max1(ci) from t1;
+-------------+--------------+
| agg_max1(i) | agg_max1(ci) |
+-------------+--------------+
|          10 |           10 |
+-------------+--------------+
1 row in set (0.00 sec)

And for something that doesn't look like integer, it doesn't work at all:

MariaDB [test]> set sql_mode= 'STRICT_ALL_TABLES';
Query OK, 0 rows affected (0.02 sec)
 
MariaDB [test]> select agg_max1(c) from t1;
ERROR 1366 (22007): Incorrect integer value: 'foo' for column 'x' at row 1
MariaDB [test]> set sql_mode= '';
Query OK, 0 rows affected (0.00 sec)
 
MariaDB [test]> select agg_max1(c) from t1;
+-------------+
| agg_max1(c) |
+-------------+
|           0 |
+-------------+
1 row in set, 3 warnings (0.00 sec)

Given that functions must define explicit types, I don't see a bug here, but it might be something users don't expect, so better to document it.

Generated at Thu Feb 08 07:22:09 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.