Details
-
Bug
-
Status: Open (View Workflow)
-
Major
-
Resolution: Unresolved
-
22.08.7
-
2023-4, 2023-5, 2023-6, 2023-7, 2023-8, 2023-10, 2023-11
Description
Consider the example:
CREATE TABLE empsalary (
|
depname varchar(100),
|
empno bigint,
|
salary int,
|
enroll_date date
|
)engine=columnstore;
|
 |
insert into empsalary values ('develop' , 10, 5200, '2007-08-01');
|
insert into empsalary values ('sales' , 1, 5000, '2006-10-01');
|
insert into empsalary values ('personnel', 5, 3500, '2007-12-10');
|
insert into empsalary values ('sales' , 4, 4800, '2007-08-08');
|
insert into empsalary values ('personnel', 2, 3900, '2006-12-23');
|
insert into empsalary values ('develop' , 7, 4200, '2008-01-01');
|
insert into empsalary values ('develop' , 9, 4500, '2008-01-01');
|
insert into empsalary values ('sales' , 3, 4800, '2007-08-01');
|
insert into empsalary values ('develop' , 8, 6000, '2006-10-01');
|
insert into empsalary values ('develop' , 11, 5200, '2007-08-15');
|
insert into empsalary values ('develop' , 12, null, '2008-08-09');
|
Here is the SELECT that differs b/w MDB and MCS. Here is MCS version.
MariaDB [test]> select avg(salary),depname, moda(salary) over(partition by depname order by enroll_date) from empsalary group by depname;
|
+-------------+-----------+--------------------------------------------------------------+
|
| avg(salary) | depname | moda(salary) over(partition by depname order by enroll_date) |
|
+-------------+-----------+--------------------------------------------------------------+
|
| 6000.0000 | develop | 6000 |
|
| 5200.0000 | develop | 5200 |
|
| 5200.0000 | develop | 5200 |
|
| 4200.0000 | develop | 5200 |
|
| 4500.0000 | develop | 5200 |
|
| NULL | develop | 5200 |
|
| 3900.0000 | personnel | 3900 |
|
| 3500.0000 | personnel | 3500 |
|
| 5000.0000 | sales | 5000 |
|
| 4800.0000 | sales | 4800 |
|
| 4800.0000 | sales | 4800 |
|
+-------------+-----------+--------------------------------------------------------------+
|
11 rows in set, 1 warning (0.080 sec)
|
Here is MDB.
MariaDB [test]> select avg(salary),depname, moda(salary) over(partition by depname order by enroll_date) from emp group by depname;
|
+-------------+-----------+--------------------------------------------------------------+
|
| avg(salary) | depname | moda(salary) over(partition by depname order by enroll_date) |
|
+-------------+-----------+--------------------------------------------------------------+
|
| 5020.0000 | develop | 5200 |
|
| 3700.0000 | personnel | 3500 |
|
| 4866.6667 | sales | 5000 |
|
+-------------+-----------+--------------------------------------------------------------+
|
3 rows in set (0.002 sec)
|
The origins of this incorrect behavior is the fact that RowStorage::hash gets an incorrect number of key columns to hash on thus MCS treats every row it receives from scanning in BPP::execute as a unique one.
The calculation of key columns number is buried deep in CSEP to JobList translation, namely in addProjectStepsToBps that is called inside combineJobStepsByTable. In simple words TAS uses the set of columns returned the previous delivery/projection step as TAS input. It removes all aggregate columns from the set and those that left are TAS key columns participate in hashing.
In case when WF is projection combineJobStepsByTable run by TAS sees aux columns added by WF so TAS hashing miscalculates every row hash making all rows potentially unique and this alters TAS result set.
Attachments
Issue Links
- relates to
-
MCOL-5187 OOM happening when querying large datasets and using distinct
- Closed