[MCOL-735] String Manipulation Functions Either Returns Incorrect Resultsets or Cause Internal Error When Applied To Japanese Data Created: 2017-06-01  Updated: 2017-07-27  Resolved: 2017-07-27

Status: Closed
Project: MariaDB ColumnStore
Component/s: ExeMgr
Affects Version/s: None
Fix Version/s: Icebox

Type: Bug Priority: Minor
Reporter: Yuichiro Kosila Assignee: Andrew Hutchings (Inactive)
Resolution: Not a Bug Votes: 0
Labels: community
Environment:

Amazon EC2 , MariaDB ColumnStore AMI



 Description   

(1) USASCII data seems not affected
(2) Phenonemon seem when Japanese data are included
(3) I have not tested on non-Japanese UTF-8 data yet
(4) I confirmed that works correctly and no problem seen when InnoDB is selected
(5) I have not tested all string manipulation functions yet , for I cannot find the list of available functions supported in ColumnStore mode in the website
(6) Currently, avoid using functions, or not storing non-ASCII data when ColumnStore engine is selected

/* Test Code (Begin) */
CREATE DATABASE IF NOT EXISTS `mytest` DEFAULT CHARSET=utf8 DEFAULT COLLATE=utf8_general_ci ;
USE `mytest` ; COMMIT ;
CREATE TABLE `mytable` ( `col1` CHAR(99) , `col2` VARCHAR(99) ) Engine=ColumnStore DEFAULT CHARSET=utf8 ;
COMMIT ;
INSERT INTO `mytable` ( `col1` , `col2` ) VALUES ( 'Aaa' , 'Bbb' ) ;
INSERT INTO `mytable` ( `col1` , `col2` ) VALUES ( 'Ccc' , 'Ddd' ) ;
INSERT INTO `mytable` ( `col1` , `col2` ) VALUES ( ' A quick brown FOX ! ' , ' Jumped over the lazy DOG ? ' ) ;
INSERT INTO `mytable` ( `col1` , `col2` ) VALUES ( 'テスト' , 'テスト' ) ;
INSERT INTO `mytable` ( `col1` , `col2` ) VALUES ( 'マリアの実験' , 'MariaDB_Test' ) ;
INSERT INTO `mytable` ( `col1` , `col2` ) VALUES ( ' 半角空白囲み ' , ' 全角空白囲み ' ) ;
COMMIT ;
SELECT `col1` , `col2` FROM `mytable` ; /* Trailing half-width whitespaces got lost ? */
SELECT `col1` , TRIM(`col1`) , `col2` , TRIM(`col2`) FROM `mytable` ; /* Internal Error */
SELECT `col1` , UPPER(`col1`) , `col2` , LOWER(`col2`) FROM `mytable` ; /* No error but incorrect result however the functions may not support full-width case conversion */
SELECT `col1` , LENGTH(`col1`) , `col2` , LENGTH(`col2`) FROM `mytable` ; /* Not char count, rather consumed bytes ? */
SELECT `col1` , REVERSE(`col1`) , `col2` , REVERSE(`col2`) FROM `mytable` ; /* Trailing blank lost in the result */
SELECT `col1` , SUBSTRING(`col1` , 2 , 2 ) , `col2` , SUBSTRING(`col2` , 2 , 2 ) FROM `mytable` ; /* No error but incorrect result */
COMMIT ;
DROP DATABASE IF EXISTS `mytest` ; COMMIT ;
USE `mysql` ; COMMIT ;
/* Test Code (End) */



 Comments   
Comment by Andrew Hutchings (Inactive) [ 2017-06-01 ]

Hi Yuichiro,

Whilst this won't solve all your issues did you set the system locale in Columnstore.xml as indicated near the end of this documentation?:

https://mariadb.com/kb/en/mariadb/mariadb-columnstore-system-usage/

It may require a restart of ColumnStore afterwards but should at least resolve the TRIM, UPPER and SUBSTRING. LENGTH is known to not work for UTF-8 but I'm not sure if this is documented, we can resolve this as part of this ticket. We do intend to improve our multi-byte handling in future versions, especially around collations and getting the current information from MariaDB Server about the session.

Comment by Yuichiro Kosila [ 2017-07-25 ]

First of all, my deep apologies being silent from this issue for a long time.
Thanks to Andrew-san and David-san, I've successfully confirmed that on
ColumnStore Version 1.0.9GA , Japanese strings are correctly handled, if ;

1)
sudo localedef -i en_US -f UTF-8 en_US.UTF-8
is set prior to proceeding into CS installation.

2)
<SystemLang>C</SystemLang>
is modified to
<SytemLang>en_US.utf8</SystemLang>
in
$CS/etc/Columnstore.xml

3)
Modify $CS/mysql/my.cnf
to add the following one line in the [client] section
default-character-set = utf8

4)
Modify $CS/mysql/my.cnf
to add the following three lines in the [mysqld] section
character-set-server = utf8
collation-server = utf8_general_ci
init-connect = 'SET NAMES utf8'

Therefore, this incident is not even a bug, solely the lack of my preparation....
My apology.

LENGTH() function gives back the length in unit of bytes, but since
CHAR_LENGTH() function returns the number of characters, and it correctly counts
even Japanese glyphs as well,. so there's no showstopper I see at this moment.

Once again, thank you for your valueable inputs, and let us close this issue.
Thank you.
Yuichiro Kosila

Comment by Andrew Hutchings (Inactive) [ 2017-07-27 ]

Based on previous comment, closing as "Not a Bug". Many thanks for your input.

Generated at Thu Feb 08 02:23:26 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.