[MCOL-5104] Statistic Functions improvement: STDDEV_SAMP, STDDEV_POP, VAR_POP, VAP_SAMP Created: 2022-05-27  Updated: 2022-06-27  Resolved: 2022-06-27

Status: Closed
Project: MariaDB ColumnStore
Component/s: PrimProc
Affects Version/s: None
Fix Version/s: 6.2.3

Type: New Feature Priority: Major
Reporter: Leonid Fedorov Assignee: Andrey Piskunov (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Blocks
blocks MCOL-5031 ARM tests Closed
Sprint: 2021-17

 Description   

Columnstore statistic functions STDDEV_SAMP, STDDEV_POP, VAR_POP, VAP_SAMP
use naive algorithm, that looses precision dramatically
see example

MariaDB [test]> create table t(a bigint) engine columnstore;
Query OK, 0 rows affected (0.344 sec)
 
MariaDB [test]> insert into t value (80000000001);
Query OK, 1 row affected (0.144 sec)
 
MariaDB [test]> insert into t value (80000000003);
Query OK, 1 row affected (0.076 sec)
 
MariaDB [test]> select std(a) from t;
+--------+
| std(a) |
+--------+
| 0.0000 |
+--------+
1 row in set (0.059 sec)
 
MariaDB [test]> create table tm(a bigint);
Query OK, 0 rows affected (0.010 sec)
 
MariaDB [test]> insert into tm value (80000000003);
Query OK, 1 row affected (0.001 sec)
 
MariaDB [test]> insert into tm value (80000000001);
Query OK, 1 row affected (0.001 sec)
 
MariaDB [test]> select std(a) from tm;
+--------+
| std(a) |
+--------+
| 1.0000 |
+--------+
1 row in set (0.001 sec)}}

Same problem we have in twin window functions.
Because of this problem we have different behavior on ARM and x86, and broken tests also.
We should replace naive algorithm with more stable like server does

Also Columnstore functions work improperly with decimals

MariaDB [test]> select * from tdec;
+--------------------------------------+
| col1                                 |
+--------------------------------------+
| 800000000000000000000000000000000001 |
| 800000000000000000000000000000000003 |
| 800000000000000000000000000000000005 |
| 800000000000000000000000000000000007 |
| 800000000000000000000000000000000011 |
| 800000000000000000000000000000000013 |
| 800000000000000000000000000000000015 |
| 800000000000000000000000000000000017 |
| 800000000000000000000000000000000019 |
| 800000000000000000000000000000000021 |
+--------------------------------------+
10 rows in set (0.045 sec)
 
MariaDB [test]> SELECT 'q1', floor(STD(col1) OVER ()) AS std FROM tdec;
+----+------+
| q1 | std  |
+----+------+
| q1 | NULL |
| q1 | NULL |
| q1 | NULL |
| q1 | NULL |
| q1 | NULL |
| q1 | NULL |
| q1 | NULL |
| q1 | NULL |
| q1 | NULL |
| q1 | NULL |
+----+------+
10 rows in set (0.010 sec)

compare with server

MariaDB [test]> select * from tdecserv;
+--------------------------------------+
| col1                                 |
+--------------------------------------+
| 800000000000000000000000000000000001 |
| 800000000000000000000000000000000003 |
| 800000000000000000000000000000000005 |
| 800000000000000000000000000000000007 |
| 800000000000000000000000000000000011 |
| 800000000000000000000000000000000013 |
| 800000000000000000000000000000000015 |
| 800000000000000000000000000000000017 |
| 800000000000000000000000000000000019 |
| 800000000000000000000000000000000021 |
+--------------------------------------+
10 rows in set (0.000 sec)
 
MariaDB [test]> SELECT 'q1', floor(STD(col1) OVER ()) AS std FROM tdecserv;
+----+------+
| q1 | std  |
+----+------+
| q1 |    0 |
| q1 |    0 |
| q1 |    0 |
| q1 |    0 |
| q1 |    0 |
| q1 |    0 |
| q1 |    0 |
| q1 |    0 |
| q1 |    0 |
| q1 |    0 |
+----+------+
10 rows in set (0.001 sec)

All of this should be fixed

See
Algorythms


Generated at Thu Feb 08 02:55:22 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.