[MCOL-1985] Fix up regr functions in regrmysql.cpp so regr_* funtions work correcly for InnoDB tables Created: 2018-11-29  Updated: 2019-10-28  Resolved: 2019-07-05

Status: Closed
Project: MariaDB ColumnStore
Component/s: MDB Plugin
Affects Version/s: 1.2.1
Fix Version/s: 1.2.5

Type: Bug Priority: Minor
Reporter: David Hall (Inactive) Assignee: Daniel Lee (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Sprint: 2019-04, 2019-05, 2019-06

 Description   

regrmysql.cpp is the UDAF plugin for the regr_*** functions. It hasn't been kept up to date.

The idea is that we also add the regr_*** functions to all engines via the Server API.

However, the algorithms haven't been kept up with the changes in the Columnstore implementations. In addition, they return a fixed decimal of 2 decimal places, which isn't correct.



 Comments   
Comment by David Hall (Inactive) [ 2019-04-18 ]

QA: The following queries should produce the same answer in InnoDB as Columnstore. In some cases, the number of decimals will be different, but it should never happen that InnoDB has fewer decimals than Columnstore in the regr_*** column.

In these queries, I created an inno_**** set of the database. If you already have a reference database set up that happens to have the Columnstore libs, then you can use that (change the table names in the queries). These queries are not magic. What's important is that InnoDB and Columnstore match in the results of the regr_*** functions. Any query that returns a non-zero result can be used for comparison, but it is important to check that the number of digits to the right of the decimal for innodb is at least that of Columnstore. A true test would use Int, double and decimal types. Also testing as a Window function might be appropriate. Be aware of MDEV-19270 – ORDER BY in the OVER() clause plus an ORDER BY on an indexed field on the query causes a debug assert. I don't know if this breaks in a release version.

WARNING: These queries can run a long time on InnoDB, and really long if there's no indexes. A smaller data sample might be better.

If you have a MariaDB reference without the Columnstore libs, you can copy libregr_mysql.so.1.0.0 from columnstore/mysql/lib to your mariadb installation. and create a symlink to libregr_mysql.so. CREATE the functions in a mariadb client:

CREATE AGGREGATE FUNCTION corr returns REAL soname 'libregr_mysql.so';
CREATE AGGREGATE FUNCTION covar_pop returns REAL soname 'libregr_mysql.so';
CREATE AGGREGATE FUNCTION covar_samp returns REAL soname 'libregr_mysql.so';
CREATE AGGREGATE FUNCTION regr_avgx returns REAL soname 'libregr_mysql.so';
CREATE AGGREGATE FUNCTION regr_avgy returns REAL soname 'libregr_mysql.so';
CREATE AGGREGATE FUNCTION regr_count returns REAL soname 'libregr_mysql.so';
CREATE AGGREGATE FUNCTION regr_intercept returns REAL soname 'libregr_mysql.so';
CREATE AGGREGATE FUNCTION regr_r2 returns REAL soname 'libregr_mysql.so';
CREATE AGGREGATE FUNCTION regr_slope returns REAL soname 'libregr_mysql.so';
CREATE AGGREGATE FUNCTION regr_sxx returns REAL soname 'libregr_mysql.so';
CREATE AGGREGATE FUNCTION regr_sxy returns REAL soname 'libregr_mysql.so';
CREATE AGGREGATE FUNCTION regr_syy returns REAL soname 'libregr_mysql.so';

SELECT empno, depname, salary, bonus, regr_avgx(empno, bonus) OVER (ORDER BY empno), MAX(bonus) OVER (ORDER BY empno) from (select *, (2013 - extract(YEAR FROM enroll_date)) * 500 as bonus from inno_empsalary3 where (empno >10000) and (empno < 10100) ) s order by depname, salary, bonus;

select avg(distinct l_extendedprice), l_tax, regr_avgy(l_tax, l_extendedprice) from inno_lineitem group by l_tax order by l_tax;

select count(l_extendedprice), l_tax, regr_count(l_tax, l_extendedprice) from inno_lineitem group by l_tax order by l_tax;

select avg(l_extendedprice), l_tax, round(regr_intercept(l_tax, l_extendedprice), 8) intercept from inno_lineitem group by l_tax order by l_tax;

select avg(l_extendedprice), l_tax, round(regr_r2(l_tax, l_extendedprice), 12) from inno_lineitem group by l_tax order by l_tax;

select avg(l_extendedprice), l_tax, regr_sxx(l_partkey, l_suppkey) from inno_lineitem group by l_tax order by l_tax;

select n_name, l_commitdate, round(regr_sxy(l_extendedprice, s_acctbal), 3) regr_sxy_bal,
sum(l_extendedprice) sum_price, min(s_suppkey) minskey, count
from inno_nation, inno_supplier, inno_lineitem
where s_nationkey in (1,2)
and l_commitdate between '1998-01-01' and '1998-01-07'
and n_nationkey = s_nationkey
and s_suppkey = l_suppkey
group by n_name, l_commitdate
order by 1, 2;

select p_brand, sum(l_quantity) tot_qty,
round(regr_syy(l_tax, l_quantity),3) syy_qty, count
from inno_part, inno_lineitem
where l_shipdate between '1996-04-01' and '1996-04-14'
and p_size = 5
and p_partkey = l_partkey
group by p_brand
order by 1;

select p_brand, sum(l_quantity) tot_qty, round(corr(l_tax, l_quantity),3) corr_qty, count from inno_part, inno_lineitem where l_shipdate between '1996-04-01' and '1996-04-14' and p_size = 5 and p_partkey = l_partkey group by p_brand order by 1;

select p_brand, sum(l_quantity) tot_qty, round(covar_pop(l_tax, l_quantity),3) covar_pop_qty, count from inno_part, inno_lineitem where l_shipdate between '1996-04-01' and '1996-04-14' and p_size = 5 and p_partkey = l_partkey group by p_brand order by 1;

select p_brand, sum(l_quantity) tot_qty, round(covar_samp(l_tax, l_quantity),3) covar_samp_qty, count from inno_part, inno_lineitem where l_shipdate between '1996-04-01' and '1996-04-14' and p_size = 5 and p_partkey = l_partkey group by p_brand order by 1;

Comment by David Hall (Inactive) [ 2019-04-23 ]

Added new PR#741
Also regression test PR#106

Comment by Daniel Lee (Inactive) [ 2019-05-14 ]

Testing has been completed on nightly build. Please review test results and checkout the issues that have been identified.

Comment by David Hall (Inactive) [ 2019-05-21 ]

The functions are returning the correct answers in both InnoDB and Columnstore.

There's a significant discrepancy between Columnstore and an Oracle reference run. The discrepancy appears to be caused by a difference in how the final sort is done. Columnstore (and MariaDB) treat the output of the regr_*** window functions as REAL and sort numerically. Oracle treats these as strings and sorts as such:

select o_custkey, regr_sxx(o_custkey,o_orderkey) OVER (PARTITION BY 55 ORDER BY o_custkey ,o_orderkey ROWS 15 PRECEDING) from orders order by 1, 2;

From columnstore:
1 0
1 1922
1 25234.666666666668
1 246914.75
1 1538759.2
2 1583767.3333333333
2 1688367.7142857143
2 1826573.5
2 2480560.222222222
2 3148438.1
2 3791879.6363636362
2 17027279.666666668
2 33985333.23076923

From oracle:
1 0.00000000
1 1538759.20000000
1 1922.00000000
1 246914.75000000
1 25234.66670000
2 1583767.33000000
2 1688367.71000000
2 17027279.70000000
2 1826573.50000000
2 2480560.22000000
2 3148438.10000000
2 33985333.20000000
2 3791879.64000000

So naturally, a diff of these results shows significant differences.

Comment by Daniel Lee (Inactive) [ 2019-07-05 ]

Build verified: 1.2.5-1 nightly

server commit:
f44f7d9
engine commit:
4e477ab

Generated at Thu Feb 08 02:32:50 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.