[MCOL-4103] Columnstore LDI doesn't support charsets, particularly utf8mb4 Created: 2020-06-24  Updated: 2020-12-14  Resolved: 2020-12-14

Status: Closed
Project: MariaDB ColumnStore
Component/s: MDB Plugin
Affects Version/s: 1.5.2
Fix Version/s: 5.5.1

Type: Bug Priority: Critical
Reporter: David Hall (Inactive) Assignee: David Hall (Inactive)
Resolution: Fixed Votes: 1
Labels: None

Issue Links:
Problem/Incident
is caused by MCOL-2000 varchar specified sizing is not in ch... Closed
Relates
relates to MCOL-4275 COLLATE Trouble with Columnstore Closed

 Description   

create table chineseCS (english varchar(20), chinese varchar(20)) engine=innodb DEFAULT CHARACTER SET utf8mb4 collate utf8mb4_unicode_520_ci engine=columnstore;

load data infile "/home/calpont/t.tbl" into table chineseCS character set utf8mb4 fields terminated by "|";

select * from chineseCS;
------------------+

english chinese

------------------+

Abhorrent NULL
Adamant NULL
Agony NULL
Antsy NULL
Appall NULL

------------------+
5 rows in set (0.146 sec)

But using traditional insert, it works:

truncate table chineseCS;
insert into chineseCS values ("Abhorrent","可惡的"),("Adamant","精金"),("Agony","痛苦"),("Antsy","螞蟻"),("Appall","驚恐");

select * from chineseCS;
--------------------+

english chinese

--------------------+

Abhorrent 可惡的
Adamant 精金
Agony 痛苦
Antsy 螞蟻
Appall 驚恐

--------------------+
5 rows in set (0.157 sec)



 Comments   
Comment by David Hall (Inactive) [ 2020-08-18 ]

We need to get cpimport to support this before LDI can.

Comment by Todd Stoffel (Inactive) [ 2020-08-25 ]

Officially MariaDB Columnstore only supports UTF8. ALL of our documentation and requirement guides mention this. This is more an enhancement than a bug.

MariaDB [test]> CREATE TABLE `lineorder`
(
	`LO_OrderKey` BIGINT(20) NOT NULL,
	`LO_LineNumber` TINYINT(4) NOT NULL,
	`LO_CustKey` INT(11) NOT NULL,
	`LO_PartKey` INT(11) NOT NULL,
	`LO_SuppKey` INT(11) NOT NULL,
	`LO_OrderDateKey` INT(11) NOT NULL,
	`LO_OrderPriority` VARCHAR(15) DEFAULT NULL,
	`LO_ShipPriority` CHAR(1) DEFAULT NULL,
	`LO_Quantity` TINYINT(4) DEFAULT NULL,
	`LO_ExtendedPrice` INT(11) DEFAULT NULL,
	`LO_OrdTotalPrice` INT(11) DEFAULT NULL,
	`LO_Discount` INT(11) DEFAULT NULL,
	`LO_Revenue` INT(11) DEFAULT NULL,
	`LO_SupplyCost` INT(11) DEFAULT NULL,
	`LO_Tax` TINYINT(4) DEFAULT NULL,
	`LO_CommitDateKey` INT(11) NOT NULL,
	`LO_ShipMode` VARCHAR(10) DEFAULT NULL
)
ENGINE = Columnstore DEFAULT CHARSET = utf8;
QUERY OK, 0 ROWS affected (0.120 sec)
 
MariaDB [test]> LOAD DATA INFILE '/tmp/test.csv'  INTO TABLE lineorder FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"';
QUERY OK, 10 ROWS affected (1.247 sec)
Records: 10  Deleted: 0  Skipped: 0  WARNINGS: 0
 
MariaDB [test]> SELECT * FROM lineorder WHERE LO_OrderKey =1 AND LO_LineNumber =1 \G
*************************** 1. ROW ***************************
     LO_OrderKey: 1
   LO_LineNumber: 1
      LO_CustKey: 221401
      LO_PartKey: 620758
      LO_SuppKey: 24819
 LO_OrderDateKey: 19960102
LO_OrderPriority: 5-LOW
 LO_ShipPriority: 0
     LO_Quantity: 17
LO_ExtendedPrice: 2853824
LO_OrdTotalPrice: 19179206
     LO_Discount: 4
      LO_Revenue: 2739671
   LO_SupplyCost: 100723
          LO_Tax: 2
LO_CommitDateKey: 19960212
     LO_ShipMode: TRUCK
1 ROW IN SET (0.381 sec)

Comment by Justin Swanhart [ 2020-08-25 ]

Well, if columnstore doesn't support anything other than UTF-8, tables with other character sets should not be able to be created. And if I have InnoDB data that I want to use with HTAP, and the InnoDB data has other character sets, how is that supposed to work. If an InnoDB table has a TINYINT UNSIGNED with values of 255, HTAP won't work either. So I think there is a real fundamental problem here.

Comment by Todd Stoffel (Inactive) [ 2020-08-25 ]

greenlion Please provide example source and target tables and we'll be happy to address your concerns.

Comment by Justin Swanhart [ 2020-08-25 ]

Note, MariaDB 10.5 has a default utf8mb4 character set. All I did was create the table and it used the default character set, a character set that is apparently not supported. That seems less than ideal to me, and it violates the policy of least surprise.

Comment by Justin Swanhart [ 2020-08-25 ]

create a table with TINYINT UNSIGNED:
create table oltp.innodb_table
(c1 tinyint unsigned)
engine=innodb;

replicate from Innodb -> columnstore for HTAP using rewrite rules.

The columnstore table should be:
create table olap.innodb_table
(c1 tinyint unsigned)
engine=columnstore;

insert into oltp.innodb_table values (255);

The values won't be able to be inserted into ColumnStore. 255 is out of range for TINYINT UNSIGNED on columnstore, but is valid for InnoDB.

MariaDB [test2]> insert into olap.innodb_table values(255);
ERROR 1264 (22003): CAL0001: IDB-2025: Data truncated for column 'c1'

Comment by Justin Swanhart [ 2020-08-25 ]

Note that yes, one could change TINYINT UNSIGNED to SMALLINT on CS - but that won't work for unsupported values for BIGINT.

Comment by Todd Stoffel (Inactive) [ 2020-08-25 ]

This is true, there are limitation to Columnstore field types and functions. We do not plan to make this a replacement for InnoDB. On certain occasions like you described above you might need to map one field type to another. Indexes would need to removed as well. These are known and documented differences between storage engines. We will try to close the gaps where appropriate (We have a few tickets related to this work MCOL-269, MCOL-641).

Comment by David Hall (Inactive) [ 2020-12-14 ]

This works now that MCOL-2000 is fixed

Generated at Thu Feb 08 02:47:49 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.