[MCOL-444] split character import issue Created: 2016-12-06  Updated: 2020-08-25  Resolved: 2017-12-06

Status: Closed
Project: MariaDB ColumnStore
Component/s: None
Affects Version/s: 1.0.5
Fix Version/s: 1.0.12, 1.1.3

Type: New Feature Priority: Minor
Reporter: David Hill (Inactive) Assignee: Daniel Lee (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Sprint: 2017-24

 Description   

select name from flights where id=250908;
----------------------------------------------------------------------------------------------------

name

----------------------------------------------------------------------------------------------------

Retargeting - Internal Run of Site - Searches that contain at least one of the following keywords

----------------------------------------------------------------------------------------------------
1 row in set, 1 warning (0.03 sec)

imysql-rea> select substr(name,1,100) from flights where id=250908;
--------------------

substr(name,1,100)

--------------------

 

--------------------
1 row in set (0.02 sec)

show warnings;
--------------------------------------------------------------------------

Level Code Message

--------------------------------------------------------------------------

Warning 1366 Incorrect string value: '\xE2\x80' for column 'name' at row 0

--------------------------------------------------------------------------
1 row in set (0.00 sec)

select length(name), char_length(name) from flights where id=250908;
-------------------------------+

length(name) char_length(name)

-------------------------------+

100 0

-------------------------------+

select hex(name) from flights where id=250908;
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

hex(name)

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

5265746172676574696E67202D20496E7465726E616C2052756E206F662053697465202D205365617263686573207468617420636F6E7461696E206174206C65617374206F6E65206F662074686520666F6C6C6F77696E67206B6579776F72647320E280

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In MySql db:
Retargeting - Internal Run of Site - Searches that contain at least one of the following keywords - "ocean", "water", "view" or "esplanade"

The source database is running MySQL 5.6.22
The data is dumped using mysqldump, sent to the InfiniDB server and imported with cpimport, which we allow to truncate strings that are too long.
The use of the substr function in the query should be superfluous as far as InfiniDB is concerned, as the column is 100 bytes, it makes the query results compatible with the MySQL servers.
That doesn't explain the char_length function results above or why InfiniDB is returning an empty string for the substr function.

I took the hex value and played with it on https://sites.google.com/site/nathanlexwww/tools/utf8-convert
The last bit, E280, appear to be a truncated multi-byte character from the original data. The '-' character.
the - code is E28093

Here is the hex of the original data from MySQL db:

5265746172676574696E67202D20496E7465726E616C2052756E206F662053697465202D205365617263686573207468617420636F6E7461696E206174206C65617374206F6E65206F662074686520666F6C6C6F77696E67206B6579776F72647320E2809320E2809C6F6365616EE2809D2C20E2809C7761746572E2809D2C20E2809C76696577E2809D206F7220E2809C6573706C616E61646522

So, cpimport is truncating bytes disregarding the number of bytes in the UTF-8 character. This is bad, need a way to fix these.
The substr and char_length functions appears to be choking if it is missing all the bits in a multi-byte character, at the end of the string anyway.

Was not sure how to categorize this, Character Set, Table Corruption or General Usage.

Making the export process of this table do the substr function up front, is probably the correct way to fix this properly. But that means doing it for all tables with varchar fields and would preclude using mysqldump, making the process much more tedious and take a lot longer.
The best solution for us is for cpimport to work better with multi-byte character sets (utf-8 for us).

A few thoughts.
1. Could be fixed in the InfiniDB engine to ignore or deal with incomplete multi-byte characters.
2. Add a switch to the cpimport command to put it in a multi-byte character safe mode.
3. If 2 slows down cpimport very much make a cpimport that is multi-byte character safe, cpimportMBS, to be used when one is importing data in a multi-byte character set. OR even auto detect it based on the data (more accurate), or based on the table or column definition (least accurate).

In our case, multi-byte characters only appear in the dimension tables, which are much smaller than the fact tables. The fact tables having only a date, int and bigint columns, where import speed is important.

I'm sure you are all well versed in UTF-8, but thought I would send this along anyway.

https://en.wikipedia.org/wiki/UTF-8

So you don't have to examine the whole string to determine if the last character is valid (complete).



 Comments   
Comment by David Hill (Inactive) [ 2016-12-06 ]

InfiniDB Issue #10684

Comment by Andrew Hutchings (Inactive) [ 2017-11-21 ]

Can you please let me know what parts are this bug? Also the "show create table" for flights?

With a suitable schema the only issue I can see at the moment is char_length() and substr() don't appear to work for UTF8.

For the cpimport truncation problem, I'm assuming this is due to a schema that doesn't have a column wide enough? If so this is non-trivial to fix. We need to modify the system catalog to support UTF8 and then add full UTF8 support to ColumnStore. This becomes a major feature (a very large piece of work) and should probably be separate.

Comment by Andrew Hutchings (Inactive) [ 2017-11-28 ]

So, at first I thought this was a problem with char_length() and substr() internally since I couldn't get it to work when the column size was in fact wide enough. It turns out this was due to me not setting SystemLang in Columnstore.xml.

The problem is two-fold:

1. cpimport truncates at the byte level instead of the character level which means for one multi-byte character only part of the character is stored
and
2. we use mbstowcs() internally to do the multibyte character counting in char_length() and substr() which will return a zero length if any part of the data is invalid (such as a truncation in problem 1)

The first problem is solvable by having cpimport use the multibyte string length functions in funcexp to calculate where to truncate. The second problem requires a new multibyte handling library.

For this bug I will attempt to solve #1 but #2 is a much bigger feature that is already covered by other tickets such as MCOL-337.

Comment by Andrew Hutchings (Inactive) [ 2017-11-29 ]

Patch makes cpimport find the correct place to truncate UTF8 data when required. This is a short to medium term hotfix until we can support the full set of MariaDB charsets.

Patch is for 1.0, will merge up through 1.1 later.

For QA:
Difficult to explain the test as Jira doesn't handle UTF8 data well. First you need to enable UTF8 (as described in our docs, the Columnstore.xml change is the most important). Then you need to create a table with a varchar(100) column and cpimport the UTF8 data represented by this hex:
5265746172676574696E67202D20496E7465726E616C2052756E206F662053697465202D205365617263686573207468617420636F6E7461696E206174206C65617374206F6E65206F662074686520666F6C6C6F77696E67206B6579776F72647320E2809320E2809C6F6365616EE2809D2C20E2809C7761746572E2809D2C20E2809C76696577E2809D206F7220E2809C6573706C616E61646522

You can convert this to UTF8 text to copy/paste into a file using:
https://sites.google.com/site/nathanlexwww/tools/utf8-convert

You then need to run char_length() on it and you should see a length value rather than 0 before the patch.

Comment by Daniel Lee (Inactive) [ 2017-12-06 ]

Builds verified: GitHub source

1.0.12-1

[root@localhost ~]# cat mariadb-columnstore-1.0.12-1-centos7.x86_64.bin.tar.txt
/root/columnstore/mariadb-columnstore-server
commit 25e9d054cd3d05683fade1b974e1730316d256ed
Merge: 89b2ea1 7c52a83
Author: David.Hall <david.hall@mariadb.com>
Date: Tue Nov 21 10:49:11 2017 -0600

Merge pull request #79 from mariadb-corporation/MCOL-954-1.0

MCOL-954 Init vtable state

/root/columnstore/mariadb-columnstore-server/mariadb-columnstore-engine
commit b112e826a2793228f5f3c1312fec5291fc1d8bf5
Merge: 7c2640f b657938
Author: David.Hall <david.hall@mariadb.com>
Date: Fri Dec 1 16:17:28 2017 -0600

Merge pull request #338 from mariadb-corporation/MCOL-1068

MCOL-1068 Improve compression_ratio() procedure

1.1.3-1

/root/columnstore/mariadb-columnstore-server
commit 632e265687674fb66bd1d704bc18032b00dd6b17
Merge: 5e9fe52 200f5be
Author: david hill <david.hill@mariadb.com>
Date: Tue Nov 21 15:22:06 2017 -0600

Merge branch 'develop-1.1' of https://github.com/mariadb-corporation/mariadb-columnstore-server into develop-1.1

/root/columnstore/mariadb-columnstore-server/mariadb-columnstore-engine
commit 4d8026618cfb5377c9a200170848092ce5660f10
Author: david hill <david.hill@mariadb.com>
Date: Wed Nov 29 09:36:24 2017 -0600

change how the os_detect is run on remote nodes

Verified mentioned test case. Character string was truncated at 98, instead of 100.

Generated at Thu Feb 08 02:21:07 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.