[MCOL-3536] Order by with UTF Created: 2019-10-01  Updated: 2021-03-19  Resolved: 2020-06-24

Status: Closed
Project: MariaDB ColumnStore
Component/s: ExeMgr
Affects Version/s: 1.2, 1.4
Fix Version/s: 1.5.2

Type: Bug Priority: Major
Reporter: David Hall (Inactive) Assignee: Daniel Lee (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Blocks
is blocked by MDEV-21504 Collation: Create shared library for ... Closed
Issue split
split to MCOL-4064 Make JOIN collation aware Closed
split to MCOL-4065 DISTINCT is case sensitive Closed
split to MCOL-4085 ALTER TABLE MODIFY COLUMN COLLATE syn... Closed
split to MCOL-4099 Function Insert not inserting. Closed
split to MCOL-4100 functions LOCATE, INSTR, STRCMP and F... Closed
Problem/Incident
causes MCOL-4620 Function within an aggregate can caus... Closed
is caused by MCOL-2178 Run CS with vanilla 10.4 Closed
Relates
relates to MCOL-2221 reverse function not work properly wi... Closed
relates to MCOL-4498 LIKE is not collation aware Closed
Sprint: 2019-06, 2020-1, 2020-2, 2020-3, 2020-4, 2020-5, 2020-6, 2020-7

 Description   

We now use the internal ORDER BY of columnstore rather than relying on Server to do the ORDER BY. However, our current collation doesn't allow for anything other than Latin-1.

A means of getting the collation type from the table and passing it to the collation step must be created. Then the sort itself must be made to use it. In addition, the WINDOW FUNCTION ORDER BY must use this same collation.



 Comments   
Comment by Roman [ 2019-10-17 ]

Plz note that native CS sorting used with subqueries also so this bug also affects all prior versions including 1.2.

Comment by Andrew Hutchings (Inactive) [ 2019-10-18 ]

Merged into 1.2.6

Comment by David Hall (Inactive) [ 2019-11-11 ]

This is on hold until we decide what path to use for collation support – ICU or import the MariaDB code.

Comment by Roman [ 2019-11-11 ]

MDB team is ready to make a library out of their encoding subsystem.

Comment by David Hall (Inactive) [ 2020-04-23 ]

This breaks working_tpch1/qa_fe_cnxFunctions/unix_timestamp.sql. The ref file has been modified for now. It needs to be fixed when this fix is in.

Comment by Gagan Goel (Inactive) [ 2020-04-23 ]

This also breaks working_tpch1_compareLogOnly/distinctAggregationAndGroupBy/distinctorders.sql. Update this ref file when the fix is in, it is currently temporarily changed to get the test to pass.

Comment by Roman [ 2020-06-11 ]

David.HallCould you suggest the methods to test the supported features subset for our QA?

Comment by David Hall (Inactive) [ 2020-06-15 ]

QA: This affects ORDER BY with char types. ORDER BY will now use the collation of the ORDER BY column(s). All character setsd default to case insensitive. For each case insensitive collation, there is generally a case sensitive collation available.

For DLee, you may want to try Mandarin: COLLATE "gb2312_chinese_ci", "gbk_chinese_ci" or "big5_chinese_ci". For case sensitive you could try "gb2312_bin", "gbk_bin" or "big5_bin".

This change also affects many of the functions that work with strings. Here, capital and lower case values should compare as if they were equal (for case insensitive collations).

These were added Monday June 15, so you may want to wait for the patch to be merged.
substring
substring_index
replace

These were in the Friday June 12 build
between
case
cast
char (1)
char_length
find_in_set
greatest
in
insert
least
left (1)
locate / instr
lpad (2)
lower / lcase (4)
reverse
right (3)
rpad (2)
strcmp
trim (2)
upper / ucase (4)

(1) Code was changed, so it needs to be tested, but unless you know of a language that doesn't us 1,2,3 etc., functionality hasn't changed
(2) Code was changed, so it needs to be tested, but functionality hasn't changed. It should work better with glyph characters, as it should find the end of strings better.
(3) Functionality remains the same, but should be more accurate with complex utf characters.
(4) Should work much better with complex utf characters.

Joins are not yet case insensitive and still work with a binary compare. See MCOL-4064.
DISTINCT is not yet case insensitive and still work with a binary compare. See MCOL-4065.
WHERE may or may not be optimized to use JOIN, so may or may not work.
Alter table to change charset or collate doesn not work.

Order and compare functionality is controlled by the prevailing character set/collation of each column involved. If comparing columns with different collations, we are not as smart as Server at choosing the best collation to use, so there may be some differences in this very rare edge case.

Comment by Daniel Lee (Inactive) [ 2020-06-23 ]

Build tested: 1.5.2-1 (community edition, b33685)

Did the first round of testing and the following is the preliminary results

DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci;

Tested:

LDI without cpimport
LDI with cpimpmort *1
cpimport
Select
Insert
Update
Select with filter *2
Update with filter
insert..select (columnstore to columnstore)
insert..select (columnstore to InnoDB)
insert..select (InnoDB to columnstore)
#
order by
join (innodb to innodb), case insensitive
(columnstore to innodb), case sensitive *3
(innodb to columnstore), case sensitive
(columnstore to columnstore), case sensitive

self join
cross engine join

substring
substring_index
replace
between
case case insensitive, the follow queries returned same results
select english, chinese, case when english='MARTYR' then '烈士' end from chinesecol;
select english, chinese, case when english='Martyr' then '烈士' end from chinesecol;
cast
char (1) - could not find a good way to test this with 2-byte chinese characters
char_length
find_in_set
greatest
in
If there is only 1 values in the string list, the 'in' function returns an error.
It works if there is more than one values. This is not related to collation.
MariaDB [mytest]> select english, chinese, chinese in ('修正主義者') from chinesecol order by chinese;
ERROR 1178 (42000): The storage engine for the table doesn't support IDB-1001: Function '=' isn't supported.

insert
string did not get inserted. *4

select english, chinese, insert(chinese,1,2, '哈羅') from chinesecol;
columnstore returns:

Revisionist 修正主義者 修正主義者
Saddle
Savagely 野蠻地 野蠻地
Shrewd 精明 精明
Strife 衝突 衝突
Sway 搖擺 搖擺
Truce 休戰 休戰
Weary 厭倦 厭倦

----------------------------------------------------------

innodb returns:

REVISIONIST 修正主義者 哈羅主義者
SADDLE 哈羅
SAVAGELY 野蠻地 哈羅地
SHREWD 精明 哈羅
STRIFE 衝突 哈羅
SWAY 搖擺 哈羅
TRUCE 休戰 哈羅
WEARY 厭倦 哈羅

----------------------------------------------------------

least
left (1)
locate / instr
returned byte postion, not character position *
lpad (2)
lower / lcase (4)
reverse
right (3)
rpad (2)
strcmp *6
compare case sensitive
trim (2)
upper / ucase (4)

issues:

1. LDI with cpimport load UTF8mb4 characters as NULL
2. ColumnStore is case sensitive on filters. InnoDB is not
MariaDB [mytest]> select * from chinesecol where english='WEARY';
Empty set (0.023 sec)
MariaDB [mytest]> select * from chinesecol where english='Weary';
----------------+

english chinese

----------------+

Weary 厭倦

----------------+
3. Cloumnstore join is case senstive, Innodb join is not
4. insert() did not insert characters
5. locate(), instr() returned byte position, not character position
6. strcmp is case sensitive

Comment by David Hall (Inactive) [ 2020-06-24 ]

1. LDI is being looked at. It is a separate issue from this JIRA.
2. This is a known limitation and will be fixed with MCOL-4064
3. This is a known limitation and will be fixed with MCOL-4064
4. A bug. MCOL-4099 Insert works unless you try to insert at position 1.
5. A bug. MCOL-4100
6. A bug. MCOL-4100

Comment by Daniel Lee (Inactive) [ 2020-06-24 ]

The requested feature has been implemented. Known and new issues are being tracked by respective individual tickets. Closing this ticket.

Generated at Thu Feb 08 02:43:26 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.