Details
-
Task
-
Status: Closed (View Workflow)
-
Critical
-
Resolution: Fixed
-
10.3.1-1, 10.3.1-2, 10.3.3-1
Description
With CREATE FUNCTION one can create functions in SQL, but this syntax doesn't allow one to create an aggregate function (like SUM, AVG, etc). This task is to add support for aggregate stored functions.
Syntax
is not decided yet. SQL Standard 2003 doesn't support aggregate stored functions (may be the newer standard does). Oracle, PostgreSQL, HSQL (may be more) all implement aggregate stored functions using their own incompatible syntax extensions. SQL Server and DB2 do not support creating new aggregate stored functions in SQL.
The syntax should at least allow for
- the code to return the function value
- the code to be called per row in a group, it accumulates data, does not return a value
- storage that is preserved between calls within a group
and possibly
- code to invoke between groups to reset the storage
- code to remove data from the group (useful for window functions)
Attachments
Issue Links
- relates to
-
MDEV-14520 Custom aggregate functions work incorrectly with WITH ROLLUP clause
-
- Closed
-
-
MDEV-24580 Aggregate stored function return always NULL in case of empty input set
-
- Open
-
-
MCOL-523 support user defined aggregate functions
-
- Closed
-
-
MDEV-11263 Aggregate Window Functions
-
- Stalled
-
-
MDEV-11301 SUGGESTION: Increase the number of Aggregate Functions
-
- Open
-
-
MDEV-14564 support FOR loop in stored aggregate functions
-
- Stalled
-
-
MDEV-28849 I_S tables don't tell that the stored function is AGGREGATE
-
- Open
-
Activity
PostgreSQL assembles an aggregate function out of two regular functions. The state, like in the HSQL case, is an INOUT argument for these functions. The syntax looks like
CREATE FUNCTION norm1 (... ) ... ; |
CREATE FUNCTION norm2 (... ) ... ; |
CREATE AGGREGATE (...) ( SFUNC=norm1, FINALFUNC=norm2, ... ) |
it also allows to specify the initial value for the state and if the function is MIN or MAX (that is, can be answered by taking the first or last value from the index).
Oracle uses an object-oriented approach. An aggregate function is an object, the state is stored in the object, initialize/iterate/terminate/merge are the methods.
create type SecondMaxImpl as object |
(
|
max NUMBER, -- highest value seen so far |
secmax NUMBER, -- second highest value seen so far |
static function ODCIAggregateInitialize(sctx IN OUT SecondMaxImpl) return number, |
member function ODCIAggregateIterate(self IN OUT SecondMaxImpl, value IN number) return number, |
member function ODCIAggregateTerminate(self IN SecondMaxImpl, returnValue OUT number, flags IN number) return number, |
member function ODCIAggregateMerge(self IN OUT SecondMaxImpl, ctx2 IN SecondMaxImpl) return number |
);
|
create type body SecondMaxImpl is |
static function ODCIAggregateInitialize(sctx IN OUT SecondMaxImpl) return number is |
begin
|
sctx := SecondMaxImpl(0, 0);
|
return ODCIConst.Success; |
end; |
-- define other methods accordingly...
|
CREATE FUNCTION SecondMax (input NUMBER) RETURN NUMBER |
PARALLEL_ENABLE AGGREGATE USING SecondMaxImpl;
|
Another approach is to use a pre-defined cursor that iterates the group. Like this:
CREATE AGGREGATE FUNCTION avg(x DOUBLE) RETURNS DOUBLE |
BEGIN
|
DECLARE count INT DEFAULT 0; |
DECLARE sum DOUBLE DEFAULT 0; |
DECLARE CONTINUE HANDLER FOR SQLSTATE '02000' RETURN sum/count; |
LOOP
|
FETCH GROUP NEXT ROW; |
SET count:=count+1; |
SET sum:=sum+x; |
END LOOP; |
END |
Note that GROUP keyword is reserved, so this syntax can be unambiguously distinguished from FETCH cursor_name.
Another possibility is to use a complex CREATE FUNCTION with many blocks:
CREATE AGGREGATE FUNCTION avg(x DOUBLE) RETURNS DOUBLE |
DECLARE count INT DEFAULT 0; -- note, state in declarations before the first BEGIN |
DECLARE sum DOUBLE DEFAULT 0; |
BEGIN -- this is called for every row in a group |
SET count:=count+1; |
SET sum:=sum+x; |
END; |
BEGIN -- this returns the result |
RETURN sum/count; |
END; |
but this is too confusing. BEGIN/END blocks need keywords or labels to specify what they do.
Can you give more details on the argument x, what i understand that it is one of the attributes of the table, and we need to have an aggregate function for the attribute x
x is the function argument. Like in
SELECT avg(val) FROM table1 GROUP BY grn; |
Yeah understand it is the argument to the function.
When we write
SELECT avg(val) FROM table1 GROUP BY grn;
val is one of the columns of our table(attribute). Am I correct?
Window function support
To work efficiently over windows, an aggregate function should be able to remove a row from a group. This can be implemented in all syntax variants discussed above. In all cases row-removal functionality is optional, it is an optimization, but a aggregate function can work over windows without it too.
Adding support for row removal is not part of this MDEV-7773, but it should be possible to add that functionality later, MDEV-7773 should be compatible with this future extension.
HSQL
The second argument IN becomes an ENUM(ADD,REMOVE,RESULT)
PostgreSQL
The definition supports a new clause REMOVEFUNC=
Oracle
The object gets a new method remove
Cursor approach
A new syntax FETCH GROUP REMOVED ROW. A value is generated when both cursors ("next" and "removed") have returned SQLSTATE 02000. Note: This is not enough, in addition return needs not to return, but that would be too confusing, wouldn't it?
Many blocks approach
One more block, that is invoked to remove a row
The complete example of the AVGOW (average over window) function using the cursor approach:
CREATE AGGREGATE FUNCTION avgow(x DOUBLE) RETURNS DOUBLE |
BEGIN
|
DECLARE count INT DEFAULT 0; |
DECLARE sum DOUBLE DEFAULT 0; |
DECLARE loop_done INT; |
DECLARE CONTINUE HANDLER FOR SQLSTATE '02000' SET loop_done=1; |
LOOP
|
SET loop_done=0; |
WHILE loop_done = 0
|
FETCH GROUP REMOVED ROW; |
SET count:=count-1; |
SET sum:=sum-x; |
END WHILE; |
SET loop_done=0; |
WHILE loop_done = 0
|
FETCH GROUP NEXT ROW; |
SET count:=count+1; |
SET sum:=sum+x; |
END WHILE; |
RETURN IF(count,sum/count,NULL); |
END LOOP; |
END |
It is more complex than necessary, but shows all the features. Note, that
- there are two cursors, not one (the presence of a second cursor makes this function “window-aware” and enables two-cursor optimization).
- after return execution continues! The function is restarted for every window partition, but it is not restarted for every frame inside a window.
- The function exhausts the REMOVED cursor first, then the NEXT cursor. This order is required. There is no logical reason for that, it is how our window functions work at the moment. This limitation can (and should) be removed, but let's do it as the next step.
- This function is universal, it works both for GROUP BY and for OVER WINDOW. In the former case the “frame” is equal to the whole group, so the function is restarted after every return (in other words, it has the usual semantics).
Another problem of the cursor approach — if we ever will support non-SQL stored procedures (like Antony Curtis work), this will unlikely to be able to use cursors. At least, not generally for all languages.
There should be ways to read cursors from any stored procedures (special functions or something like this) in any case we get control at that moment and do what we need. It is as difficult as for our procedures IMHO.
The proposed syntax looks fine for simple cases.
I'd propose to think towards another direction though:
implement aggregate functions as SQL classes!
Internally all aggregate functions are implemented as classes in the Item_sum hierarchy and they have various methods and members. To be able to support all features that a built-in aggregate function supports (like WINDOW functionality, DISTINCT, etc) we could implement stored functions as classes as well, using the SQL standard definitions for user-defined types.
See the "<user-defined type definition>" in the SQL standard.
A skeleton for a class definition would look like this:
-- Define an aggregate function handler
|
-- Derive it from some top-level built-in abstract class MariaDBAggregateFunctionHandler
|
CREATE TYPE handler_avgow UNDER MariaDBAggregateFunctionHandler |
AS
|
(
|
count INT DEFAULT 0, # member 1 |
sum DOUBLE DEFAULT 0 # member 2 |
)
|
OVERRIDING METHOD val_real() RETURNS DOUBLE |
OVERRIDING METHOD fix_length_and_dec() RETURNS BOOLEAN |
OVERRIDING METHOD reset_and_add() RETURNS BOOLEAN |
...
|
;
|
After a class is defined, we can create methods using the standard syntax:
(in SQL method cannot be defined "inline", so there will be separate statements for methods:
-- Define aggregate handler methods
|
CREATE METHOD val_real() FOR handler_avgov |
BEGIN |
RETURN IF(count,sum/count,NULL); |
END; |
CREATE METHOD fix_length_and_dec() ...; |
...
|
Now, when we have a class with its methods created with the standard syntax, we can create an aggregate function using non-standard syntax, for example like this:
-- Define the function itself
|
CREATE AGGREGATE FUNCTION avgow(x DOUBLE) RETURNS DOUBLE |
USING handler_avgrow;
|
For non-SQL definitions we could use some syntax (perhaps it already exists in the SQL standard, I saw something like EXTERNAL NAME '...', which looks very related) to load implementations from external shared libraries. This syntax could include <language clause>.
How SQL and non-SQL definitions would relate:
- The CREATE TYPE statement will be the same for an SQL and non-SQL implementation.
- The CREATE METHOD statement will be different for SQL and non-SQL!!! (explicit SQL routine definition vs external module reference)
- The CREATE AGGREGATE FUNCTION statement will be the same for SQL and non-SQL
Advantages of the class-based conception:
- A great step towards the SQL standard. This CREATE TYPE useful per se, to create user-defined data types.
- Splits the code logically. You don't have to care about the order of different SQL code pieces, like cursors in the above example. This code will reside in different methods, so the server will make sure to execute them in proper order. Also, you don't have to care about signals. All together this should give better performance, and better user-defined SQL code readability.
- SQL and Non-SQL have very close interface (only CREATE METHOD differs)
- Classes can be used to add SQL code for other user-defined purposes (not only for aggregate functions), e.g.:
- Hybrid type functions, which need not only val_xxx() (RETURN in terms of SQL), but also fix_length_and_dec() to decide their return type depending on the parameters passed.
- PROCEDURE in SQL (rewrite the SELECT data set into a different data set)
- SQL code wrappers around pluggable modules for various server components (for example parser or optimizer), engines, and all other plugin types. A user defined parser could derive from the built-in parser to call the derived parse method then additionally store some statistics somewhere using SQL. An SQL wrapper for an engine could override some engine methods (using SQL) to disable certain components (e.g. there is a request to disable everything in ConnectSE, except ODBC). So there could be two instances of ConnectSE: the standard full-featured one with full functionality available for authorized users, and a derived ODBC-only one for non-authorized users.
This is very similar to Oracle approach (see above).
But this is way too verbose and complex in my opinion. It could be simplified
as:
CREATE AGGREGATE FUNCTION avgow |
AS
|
(
|
count INT DEFAULT 0, -- member 1 |
sum DOUBLE DEFAULT 0 -- member 2 |
)
|
OVERRIDING METHOD val_real() RETURNS DOUBLE |
OVERRIDING METHOD fix_length_and_dec() RETURNS BOOLEAN |
OVERRIDING METHOD reset_and_add() RETURNS BOOLEAN |
...
|
;
|
CREATE METHOD val_real() FOR avgow |
BEGIN |
RETURN IF(count,sum/count,NULL); |
END; |
CREATE METHOD fix_length_and_dec() ...; |
...
|
or even as
CREATE AGGREGATE FUNCTION avgow |
AS
|
(
|
count INT DEFAULT 0, -- member 1 |
sum DOUBLE DEFAULT 0 -- member 2 |
);
|
CREATE METHOD val_real() FOR avgow |
BEGIN |
RETURN IF(count,sum/count,NULL); |
END; |
CREATE METHOD fix_length_and_dec() ...; |
...
|
But in certain cases it is more limited than cursor-based approach, it doesn't
allow one to implement a median or a percentile function, because you cannot
have temporary tables or arrays in the AS ( ... ) clause. Still, both
approaches can be combined:
CREATE AGGREGATE FUNCTION avgow(x DOUBLE) RETURNS DOUBLE |
BEGIN
|
-- ... aggreagate function body, as above |
END; |
CREATE METHOD fix_length_and_dec() ... FOR avgow... |
...
|
And the beauty of this is that it perfectly applies to non-aggregate functions
too:
CREATE FUNCTION foo(x DOUBLE) RETURNS DECIMAL(3,1) |
BEGIN
|
-- ... |
END; |
CREATE METHOD fix_length_and_dec() ... FOR foo... |
...
|
Hm, may be omitting the OVERRIDING clause is not a good idea — it might allow someone from a concurrent connection to invoke the function before it's fully defined. And that might be dangerous, if, for example, one of the overriding methods does some kind of the access checks.
But OVERRIDING can be made optional. In the above case one will specify OVERRIDING, but often one will be able to omit it.
Or, may be, it'll be simpler to stick to the standard here and require the OVERRIDING clause?
Good idea to build CREATE TYPE inside CREATE [AGGREGATE] FUNCTION.
We should probably try to define method bodies inline (without having separate CREATE METHOD statements).
This would solve the problem with concurrent connections.
The current test case seems to corrupt the procs table:
delimiter |;
|
|
create aggregate function agg_sum(x INT) returns double |
begin
|
declare z double default 0; |
declare continue handler for not found return z; |
|
loop
|
fetch group next row; |
set z = z + x; |
end loop; |
end| |
|
delimiter ;|
|
|
create table t1 (id int, salary int); |
|
INSERT INTO t1 VALUES (1, 100), (2, 40), (3, 6); |
|
show create function agg_sum; |
alter function agg_sum aggregate none; |
show create function agg_sum; |
Final query fails with:
# This fails with 'show create function agg_sum' failed: 1457: Failed to load routine test.agg_sum. The table mysql.proc is missing, corrupt, or contains bad data (internal code -6)
|
Just a few minor dead-code removal and history cleanup, then we can push.
A final rebase was required, resolved the conflicts and have sent it to cvicentiu for review.
Have pushed it to bb-10.3-varun for final testing
I haven't got anything of significance in a round of tests on bb-10.3-varun 7d8c5ee4039.
Testing of correctness of results is limited due to numerous existing problems in this area, unrelated to the new functionality. I haven't got any functionality-specific crashes.
After fixing MDEV-14520 and addressing all reviews, please go ahead and push to 10.3. If reviews require significant changes in the code, please notify so that I re-run the tests.
Please make sure the functionality is documented in the KB, if not done yet.
Among other things, I think it should be specifically mentioned in the documentation that unlike built-in functions, the custom ones don't provide automatic type recognition and act according to the types specified in the function itself. For a simple example, consider the following
create or replace table t1 (i int, c char(8), ci char(8)); |
insert into t1 values (10,'foo','10'),(2,'bar','2'); |
|
delimiter $
|
create or replace aggregate function agg_max1(x INT) returns INT |
begin
|
declare m INT default NULL; |
declare continue handler for not found return m; |
loop
|
fetch group next row; |
if (x is not null) and (m is null or m < x) then |
set m= x; |
end if; |
end loop; |
end $ |
|
create or replace aggregate function agg_max2(x BLOB) returns BLOB |
begin
|
declare m BLOB default NULL; |
declare continue handler for not found return m; |
loop
|
fetch group next row; |
if (x is not null) and (m is null or m < x) then |
set m= x; |
end if; |
end loop; |
end $ |
delimiter ;
|
Here, column `i` is integer, column `ci` is text which contains integer values, and column `c` is text.
The in-built function MAX will treat each of them accordingly – the result will be based on integer comparison for `i`, on text comparison for `ci` and `c`.
MariaDB [test]> select max(i), max(ci), max(c) from t1; |
+--------+---------+--------+ |
| max(i) | max(ci) | max(c) | |
+--------+---------+--------+ |
| 10 | 2 | foo |
|
+--------+---------+--------+ |
1 row in set (0.00 sec) |
Definitions of functions agg_max1 and agg_max2 are identical, except that agg_max1 works with integers and agg_max2 with blobs.
On the same data, agg_max2 will do text comparison for all columns, regardless their initial type
MariaDB [test]> select agg_max2(i), agg_max2(ci), agg_max2(c) from t1; |
+-------------+--------------+-------------+ |
| agg_max2(i) | agg_max2(ci) | agg_max2(c) |
|
+-------------+--------------+-------------+ |
| 2 | 2 | foo |
|
+-------------+--------------+-------------+ |
1 row in set (0.00 sec) |
Function agg_max1 will do integer comparison for anything that looks like integer – in this case, for `i` and for `ci`:
MariaDB [test]> select agg_max1(i), agg_max1(ci) from t1; |
+-------------+--------------+ |
| agg_max1(i) | agg_max1(ci) |
|
+-------------+--------------+ |
| 10 | 10 |
|
+-------------+--------------+ |
1 row in set (0.00 sec) |
And for something that doesn't look like integer, it doesn't work at all:
MariaDB [test]> set sql_mode= 'STRICT_ALL_TABLES'; |
Query OK, 0 rows affected (0.02 sec) |
|
MariaDB [test]> select agg_max1(c) from t1; |
ERROR 1366 (22007): Incorrect integer value: 'foo' for column 'x' at row 1 |
MariaDB [test]> set sql_mode= ''; |
Query OK, 0 rows affected (0.00 sec) |
|
MariaDB [test]> select agg_max1(c) from t1; |
+-------------+ |
| agg_max1(c) |
|
+-------------+ |
| 0 |
|
+-------------+ |
1 row in set, 3 warnings (0.00 sec) |
Given that functions must define explicit types, I don't see a bug here, but it might be something users don't expect, so better to document it.
A possible approach is to have only one function that does everything — accumulates and returns the result. This is what HSQL has:
...
here the first argument — x — is the actual argument of the aggregate function (called as SELECT udagv(x) ...), the second — flag — tells where to accumulate or return the result, the last two — acc and counter — are the storage, their value is preserved between calls and automatically reset to NULL when a new group starts.
The function then does something like