[MCOL-3503] Create MODA function for columnstore Created: 2019-09-16  Updated: 2022-02-28  Due: 2019-09-25  Resolved: 2019-11-18

Status: Closed
Project: MariaDB ColumnStore
Component/s: ExeMgr
Affects Version/s: 1.4
Fix Version/s: 1.4.1

Type: New Feature Priority: Major
Reporter: David Hall (Inactive) Assignee: Daniel Lee (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Relates
Epic Link: ColumnStore Compatibility Improvements
Sprint: 2019-06

 Description   

A new aggregate function MODA which takes the mode of the input numeric field and determines the mode. Tie breakers are: Closest to the AVG, then smallest absolute value.



 Comments   
Comment by David Hall (Inactive) [ 2019-09-25 ]

The Columnstore UDAF uses a template class to differentiate the types of input and allows for smallest type in the hash table.

The Server UDF uses the return type to dictate the datatype of the hash table. This does not allow for the smallest possible hash table. The Return Types are INT_RESULT, REAL_RESULT, DECIMAL_RESULT and STRING_RESULT, with DECIMAL_RESULT expected to return a c string. So we have int64_t, double, char*, char*. That's our options. There doesn't appear to be any way to determine the characteristics of the incoming column. For decimal, I chose to store as long double, as that seemed more efficient than trying to store the incoming decimal strings.

Comment by David Hall (Inactive) [ 2019-09-30 ]

There's a regression test pull request as well

Comment by Daniel Lee (Inactive) [ 2019-10-07 ]

Build tested: 1.4.1-1 source

Engine code:

[root@localhost engine]# git show
commit fe5d815a458e41f7e0dd53b1fc2f5245281da5b8
Merge: 51129d6 986e541
Author: David.Hall <david.hall@mariadb.com>
Date: Mon Oct 7 08:34:59 2019 -0500
Merge pull request #891 from LinuxJedi/fix-ubuntu18
Fix abs() usage for Ubuntu 18.04

The moda() should return 2.

I worked with Mr. Hall and there seems to be an issue here:

MariaDB [(none)]> select * from mysql.func where name = 'moda';
----------------------------------+

name ret dl type

----------------------------------+

moda 1 libregr_mysql.so aggregate

----------------------------------+

After I changed the value for ret to 4 and restarted the system, the same query would return the correct result.

Comment by David Hall (Inactive) [ 2019-10-07 ]

Changed return type to 4 (DECIMAL_RESULT) in install_calpont_mysql.sh

Comment by Daniel Lee (Inactive) [ 2019-10-08 ]

Build tested: 1.4.1-1 source

[root@localhost engine]# git show
commit 1af0248ae5946efad46e6bce6af761b75da014f7
Merge: ccb36e9 49d63de
Author: Andrew Hutchings <andrew@linuxjedi.co.uk>
Date: Tue Oct 8 08:08:41 2019 +0100

Merge pull request #896 from mariadb-corporation/updateGitIgnore

update .gitignore to reflect generated scripts

For unsigned integer and unsigned bigint, incorrect result is returned:

--------------
CREATE TABLE mcol3503 (C1 INTEGER UNSIGNED) ENGINE=COLUMNSTORE
--------------

Query OK, 0 rows affected (0.168 sec)

--------------
INSERT mcol3503 VALUES (1),(2),(2),(2),(3),(4),(8),(8),(8)
--------------

Query OK, 9 rows affected (0.580 sec)
Records: 9 Duplicates: 0 Warnings: 0

--------------
SELECT AVG(C1) FROM mcol3503
--------------

---------

AVG(C1)

---------

4.2222

---------
1 row in set (0.134 sec)

--------------
SELECT MODA(C1) FROM mcol3503
--------------

----------

MODA(C1)

----------

8

----------
1 row in set (0.016 sec)

--------------
CREATE TABLE mcol3503 (C1 BIGINT UNSIGNED) ENGINE=COLUMNSTORE
--------------

Query OK, 0 rows affected (0.172 sec)

--------------
INSERT mcol3503 VALUES (1),(2),(2),(2),(3),(4),(8),(8),(8)
--------------

Query OK, 9 rows affected (0.890 sec)
Records: 9 Duplicates: 0 Warnings: 0

--------------
SELECT AVG(C1) FROM mcol3503
--------------

---------

AVG(C1)

---------

4.2222

---------
1 row in set (0.163 sec)

--------------
SELECT MODA(C1) FROM mcol3503
--------------

----------

MODA(C1)

----------

8

----------
1 row in set (0.016 sec)

Both queries should return 2 instead of 8.

Also verify moda() with subquery, functions, and having clause.

Comment by Daniel Lee (Inactive) [ 2019-10-09 ]

The following test case also failed, for both SIGN and UNSIGNED data types:
TINYINT, SMALLINT, INTEGER, BIGINT, DECIMAL(4), DECIMAL(18)
but DECIMAL(4,2) did return 6.00 as expected.

Does it have something to do with the failed data types being integers?

Example for TINYINT. A return value of 6 is expected
--------------
TRUNCATE TABLE mcol3503
--------------

Query OK, 0 rows affected (0.247 sec)

--------------
INSERT mcol3503 VALUES (1),(2),(3),(4),(5),(6),(7),(8),(15)
--------------

Query OK, 9 rows affected (0.633 sec)
Records: 9 Duplicates: 0 Warnings: 0

--------------
SELECT AVG(C1) FROM mcol3503
--------------

---------

AVG(C1)

---------

5.6667

---------
1 row in set (0.140 sec)

--------------
SELECT 'TINYINT', MODA(C1), IF(MODA(C1)=6,'Passed', 'Failed') AS STATUS FROM mcol3503
--------------

-----------------------

TINYINT MODA(C1) STATUS

-----------------------

TINYINT 5 Failed

-----------------------
1 row in set (0.018 sec)

Comment by David Hall (Inactive) [ 2019-10-11 ]

For unsigned, the issue is this:
avg = 4 (truncated from 4.22 because of integer arithmatic)
value = 2. Calculating distance: the calculation is abs(value - avg) . Because it's unsigned and value < avg, we get value-avg = 4294967294.

The second issue with the return value being 5 rather than 6 is indeed due to INTEGER arithmetic. AVG is being truncated before the compare, so the AVG 5.6667 is turned into 5.

Comment by David Hall (Inactive) [ 2019-10-11 ]

The fix is to use long double arithmetic for tie breaking regardless of the column data type. Long double gives us the full unsigned int precision for integers (on most platforms), while double would have more rounding error for large values.

Comment by Daniel Lee (Inactive) [ 2019-10-30 ]

Build tested: 1.4.1-1

[dlee@master centos7]$ cat gitversionInfo.txt
engine commit:
3e7a964

Encountered the following issue. Investigated on the issue but have not figured out the cause.

MariaDB [mytest]> select moda(c1) from mcol3505;
ERROR 1146 (42S02): Table 'mytest.mcol3505' doesn't exist
MariaDB [mytest]> select moda(c1) from mcol3503;
ERROR 1305 (42000): FUNCTION mytest.moda does not exist
MariaDB [mytest]> CREATE AGGREGATE FUNCTION moda returns DECIMAL soname 'libregr_mysql.so';
ERROR 1125 (HY000): Function 'moda' already exists

Comment by Daniel Lee (Inactive) [ 2019-11-18 ]

Build verfied: 1.4.1-1 github source

Server
commit 77a245fe5658b8d6d937620586ecd802b3432a78
Author: Marko Mäkelä <marko.makela@mariadb.com>
Date: Sun Nov 17 20:04:11 2019 +0200

Engine
commit 7c6a086cfb54b8bbd500efb41f34c9fa1ed03ca1
Merge: f291d88 1a94d53
Author: Roman Nozdrin <drrtuy@gmail.com>
Date: Mon Nov 18 12:11:30 2019 +0300

Make a build from source and verified test case

Generated at Thu Feb 08 02:43:12 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.