[MCOL-521] add distributed regression aggregate and window functions Created: 2017-01-18  Updated: 2018-11-05  Resolved: 2018-11-05

Status: Closed
Project: MariaDB ColumnStore
Component/s: ExeMgr
Affects Version/s: None
Fix Version/s: 1.2.0

Type: New Feature Priority: Major
Reporter: David Thompson (Inactive) Assignee: Patrick LeBlanc (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Problem/Incident
causes MCOL-1669 Regression in working_tpch1/misc/bug3... Closed
causes MCOL-1793 regr_slope() and regr_r2() produce in... Closed
Relates
relates to MDEV-17467 Add linear regression functions (Full... Open
relates to MCOL-1201 Allow UDAnF to have multiple paramete... Closed
Sprint: 2018-12, 2018-13, 2018-14, 2018-15, 2018-16, 2018-17, 2018-18

 Description   

add support for the regr_* functions as aggregate and window functions.



 Comments   
Comment by David Thompson (Inactive) [ 2017-06-03 ]

This has been scoped out of 1.1 due to the fact that there is currently no support for aggregate functions with multiple parameters. This would need to be added first.

Comment by David Hall (Inactive) [ 2018-07-31 ]

For convenience, I've copied the requirements here:
Regression aggregate and window functions are part of ANSI SQL. However MySQL and MariaDB does not support these functions. For predictive analytics application ability to do regression analysis is key to be able to do forecasting. These functions are:

Function
Input data type
Result data type
Description
regr_avgx(Y, X)
double precision
double precision
average of the independent variable (sum(X)/N)
regr_avgy(Y, X)
double precision
double precision
average of the dependent variable (sum(Y)/N)
regr_count(Y, X)
double precision
bigint
number of input rows in which both expressions are nonnull
regr_slope(Y, X)
double precision
double precision
slope of the least-squares-fit linear equation determined by the (X, Y) pairs
regr_intercept(Y, X)
double precision
double precision
y-intercept of the least-squares-fit linear equation determined by the (X, Y) pairs
regr_r2(Y, X)
double precision
double precision
square of the correlation coefficient. correlation coefficient is the regr_intercept(Y, X) for linear model

When regression line is linear for dependant variable y and independent variable x such that it can be represented by y = a + bx , the regression coefficient is the constant (a) that represents the rate of change of one variable as a function of changes in the other .

Slope(b) = (NΣXY - (ΣX)(ΣY)) / (NΣX2 - (ΣX)2)
Intercept(a) = (ΣY - b(ΣX)) / N

The purpose of this feature is to support the above listed regression functions as aggregate and window functions.

MariaDB ColumnStore shall support following aggregate functions
regr_avgx(Y, X)
regr_avgy(Y, X)
regr_count(Y, X)
regr_slope(Y, X)
regr_intercept(Y, X)
regr_r2(Y, X)
MariaDB ColumnStore shall support following window functions
regr_avgx(Y, X)
regr_avgy(Y, X)
regr_count(Y, X)
regr_slope(Y, X)
regr_intercept(Y, X)
regr_r2(Y, X)

Example Details to be added: TBD

Comment by David Hall (Inactive) [ 2018-10-01 ]

Due to the upgrade to MariaDB 10.3, a manual merge was necessary and branch MCOL-521-b was created. So the merges required are MCOL-521-b #578 for code and MCOL-521 #76 for test.

Comment by Andrew Hutchings (Inactive) [ 2018-10-02 ]

Unfortunately CentOS 6 buildbot failed:

/data/buildbot/bb-worker/centos6_PR/mariadb-columnstore-engine/utils/windowfunction/wf_udaf.cpp: In member function ‘virtual void windowfunction::WF_udaf::operator()(int64_t, int64_t, int64_t)’:
/data/buildbot/bb-worker/centos6_PR/mariadb-columnstore-engine/utils/windowfunction/wf_udaf.cpp:819: error: using ‘typename’ outside of template
/data/buildbot/bb-worker/centos6_PR/mariadb-columnstore-engine/utils/windowfunction/wf_udaf.cpp:853: error: using ‘typename’ outside of template
/data/buildbot/bb-worker/centos6_PR/mariadb-columnstore-engine/utils/windowfunction/wf_udaf.cpp:892: error: using ‘typename’ outside of template
/data/buildbot/bb-worker/centos6_PR/mariadb-columnstore-engine/utils/windowfunction/wf_udaf.cpp:925: error: using ‘typename’ outside of template
/data/buildbot/bb-worker/centos6_PR/mariadb-columnstore-engine/utils/windowfunction/wf_udaf.cpp:958: error: using ‘typename’ outside of template
/data/buildbot/bb-worker/centos6_PR/mariadb-columnstore-engine/utils/windowfunction/wf_udaf.cpp:994: error: using ‘typename’ outside of template

Comment by David Hall (Inactive) [ 2018-10-02 ]

Removed the "typename". Interesting that it compiled on my CentOS 6 with that in there.

Comment by Daniel Lee (Inactive) [ 2018-10-11 ]

Build verified: Forgot the capture git info. The build was made in the morning of Oct 10.

Verified the regression*() functions using the datatypetestm and 1mb DBT3 orders tables and against Oracle 18. Test results were within expectation, with some expected variance.

Also verified the same regr*() functions in windowing function syntax. Each function has 247 queries and all of them execute without any errors. The test was done on a build made today.

commit 6b44f0d9c453ede53024f525b7ddf32b5323171b
Merge: 7db44a7 853a0f7
Author: Andrew Hutchings <andrew@linuxjedi.co.uk>
Date: Thu Sep 27 20:37:03 2018 +0100

Merge pull request #134 from mariadb-corporation/versionCmakeFix

port changes for mysql_version cmake to fix columnstore RPM packaging

/root/columnstore/mariadb-columnstore-server/mariadb-columnstore-engine
commit 39c283281af045e5b5fb3fe3f399b21a6b1236ca
Merge: 46775f8 19c8a2b
Author: Roman Nozdrin <drrtuy@gmail.com>
Date: Wed Oct 10 20:11:12 2018 +0300

Merge pull request #588 from mariadb-corporation/MCOL-266

MCOL-266 Support true/false DDL default values

/root/columnstore/mariadb-columnstore-tools
commit c2a70128825ba497add8fcf0d0a6d7bbe2af9893
Merge: 1697bdb 3bc9cb0
Author: Andrew Hutchings <andrew@linuxjedi.co.uk>
Date: Wed Oct 10 08:39:45 2018 +0100

Merge pull request #14 from mariadb-corporation/MCOL-1242

MCOL-1242 - remote cpimport

Comment by Daniel Lee (Inactive) [ 2018-10-11 ]

I checked some of the window function test results for each of the regr*() functions, there seemed to be issues with regr_r2() and regr_slope(). The results between MCS 1.2 and Oracle 18 are more then just significant digits/precisions.
The other four functions seemed to be having similar results.

one slope example.
select o_custkey, regr_slope(o_custkey,o_orderkey) OVER (PARTITION BY abs(o_custkey) ORDER BY o_custkey ,o_orderkey ROWS BETWEEN 15 PRECEDING AND 15 PRECEDING) from (select * from orders where o_custkey <= 20000) s order by 1, 2;

Below are the last 10 rows of results

Oracle:
149.00000000 0.0000
149.00000000 0.0000
149.00000000 0.0000
149.00000000 0.0000
149.00000000 0.0000
149.00000000 0.0000
149.00000000 0.0000
149.00000000 0.0000
149.00000000 0.0000
149.00000000 0.0000
149.00000000 0.0000

columnstore:
149 0.09149093
149 0.09149093
149 0.09651662
149 0.10292585
149 0.11371947
149 0.12166083
149 0.13115944
149 0.13731343
149 0.15562813
149 0.33258929

Comment by Daniel Lee (Inactive) [ 2018-10-11 ]

Closing this ticket. Identified issue is being tracked by MCOL-1793

Comment by David Hall (Inactive) [ 2018-11-05 ]

Missing the stub code for distinct_count()

Comment by David Hall (Inactive) [ 2018-11-05 ]

This is the stub code for distinct_count. Just close this JIRA again after merge.

Comment by David Hall (Inactive) [ 2018-11-05 ]

Missing code has been merged. There's no need for Test to be involved

Generated at Thu Feb 08 02:21:42 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.