[MCOL-4727] Queries which use GROUP BY with ORDER BY HEX return more rows than the rules of character identity expects in some collations. Created: 2021-05-18  Updated: 2022-02-18

Status: Open
Project: MariaDB ColumnStore
Component/s: PrimProc
Affects Version/s: 5.5.1
Fix Version/s: Icebox

Type: Bug Priority: Major
Reporter: Alexander Barkov Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: None


 Description   

I run this script:

DROP TABLE IF EXISTS t1;
CREATE TABLE t1 (c1 VARCHAR(20) CHARACTER SET latin1 COLLATE latin1_swedish_ci) ENGINE=ColumnStore;
INSERT INTO t1 VALUES ('a'),('A'),('ä'),('Ä'),('ã'),('Ã');
SELECT c1, COUNT(*) FROM t1 GROUP BY c1;

+------+----------+
| c1   | COUNT(*) |
+------+----------+
| ä    |        2 |
| a    |        4 |
+------+----------+

SELECT DISTINCT c1 FROM t1;

+------+
| c1   |
+------+
| a    |
| ä    |
+------+

Looks good so far. In the given collation:

  • A WITH TILDE ABOVE is equal to accent-less A
  • A WITH DIAERESIS is a separate letter (sorted in the end of the Swedish alphabet after Z)

If I do not use HEX in ORDER BY - everything is fine also.

MariaDB [gjd]> SELECT c1, COUNT(*) FROM t1 GROUP BY c1 ORDER BY c1;
+------+----------+
| c1   | COUNT(*) |
+------+----------+
| a    |        4 |
| ä    |        2 |
+------+----------+

Things go wrong if do ORDER BY HEX(c1):

SELECT c1, COUNT(*) FROM t1 GROUP BY c1 ORDER BY HEX(c1);

+------+----------+
| c1   | COUNT(*) |
+------+----------+
| A    |        1 |
| a    |        1 |
| Ã    |        1 |
| Ä    |        1 |
| ã    |        1 |
| ä    |        1 |
+------+----------+

SELECT DISTINCT c1 FROM t1 ORDER BY HEX(c1);

+------+
| c1   |
+------+
| A    |
| a    |
| Ã    |
| Ä    |
| ã    |
| ä    |
+------+

Notice, the numbers of records in the result sets have changed. These two queries are expected to return two rows each.



 Comments   
Comment by David Hall (Inactive) [ 2021-05-21 ]

The problem occurs when you give an ORDER BY that conflicts with the aggregate. It depends on how the engine interprets it. Should it run the query, get a result set, then look at the function in the ORDER BY? This appears to be what InnoDB does.
Or should it run all the functions to get the result set, then apply the ORDER BY. This appears to be what Columnstore does.

SELECT c1, COUNT(*) FROM t1 GROUP BY c1 ORDER BY HEX(c1);

So in the first case, we get two rows that HEX is applied to.
In the second, HEX is applied during row accumulation, thus resulting in unique values for each row in the table.

The problem is exacerbated by the fact that these functions are run in PrimProc. Order By is run in ExeMgr. This is why the function is sent as part of the query to PrimProc. We may need to re-engineer a bunch of stuff such that functions that are part of Order By are not sent to PrimProc, but rather just ask for the value of the columns in the function. Then run the function in ExeMgr as part of the Order By By process.

Some additional thought needs to be put into this, since ORDER BY can contain multiple functions, nested functions, and functions on columns that are not part of the query's result set. How do you handle

SELECT c1, COUNT(*) FROM t1 GROUP BY c1 ORDER BY HEX(c2);

Other stange combinations can be found, I'm sure. Each will have to be tested in InnoDB to see what behavior is needed.

Comment by David Hall (Inactive) [ 2021-05-21 ]

Answer to the last example (InnoDB)

SELECT c1, COUNT(*) FROM t1 GROUP BY c1 ORDER BY HEX(c2);
+------+----------+
| c1   | COUNT(*) |
+------+----------+
| a    |        4 |
| ä    |        2 |
+------+----------+

Even though, in this example, c2 had six unique values, the query still managed to get only two rows. It's obvious it chose the value for the first row in each group.

Comment by Alexander Barkov [ 2021-05-21 ]

A workaround is to use this query with ColumnStore:

SELECT c1, COUNT(*) FROM t1 GROUP BY c1 ORDER BY MIN(HEX(c1));

+------+----------+
| c1   | COUNT(*) |
+------+----------+
| a    |        4 |
| ä    |        2 |
+------+----------+

Generated at Thu Feb 08 02:52:31 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.