[MCOL-444] split character import issue Created: 2016-12-06 Updated: 2020-08-25 Resolved: 2017-12-06 |
|
| Status: | Closed |
| Project: | MariaDB ColumnStore |
| Component/s: | None |
| Affects Version/s: | 1.0.5 |
| Fix Version/s: | 1.0.12, 1.1.3 |
| Type: | New Feature | Priority: | Minor |
| Reporter: | David Hill (Inactive) | Assignee: | Daniel Lee (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Sprint: | 2017-24 |
| Description |
|
select name from flights where id=250908;
----------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------- imysql-rea> select substr(name,1,100) from flights where id=250908;
-------------------- -------------------- show warnings;
--------
-------- select length(name), char_length(name) from flights where id=250908;
-------------
------------- select hex(name) from flights where id=250908;
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- In MySql db: The source database is running MySQL 5.6.22 I took the hex value and played with it on https://sites.google.com/site/nathanlexwww/tools/utf8-convert Here is the hex of the original data from MySQL db:
So, cpimport is truncating bytes disregarding the number of bytes in the UTF-8 character. This is bad, need a way to fix these. Was not sure how to categorize this, Character Set, Table Corruption or General Usage. Making the export process of this table do the substr function up front, is probably the correct way to fix this properly. But that means doing it for all tables with varchar fields and would preclude using mysqldump, making the process much more tedious and take a lot longer. A few thoughts. In our case, multi-byte characters only appear in the dimension tables, which are much smaller than the fact tables. The fact tables having only a date, int and bigint columns, where import speed is important. I'm sure you are all well versed in UTF-8, but thought I would send this along anyway. https://en.wikipedia.org/wiki/UTF-8 So you don't have to examine the whole string to determine if the last character is valid (complete). |
| Comments |
| Comment by David Hill (Inactive) [ 2016-12-06 ] |
|
InfiniDB Issue #10684 |
| Comment by Andrew Hutchings (Inactive) [ 2017-11-21 ] |
|
Can you please let me know what parts are this bug? Also the "show create table" for flights? With a suitable schema the only issue I can see at the moment is char_length() and substr() don't appear to work for UTF8. For the cpimport truncation problem, I'm assuming this is due to a schema that doesn't have a column wide enough? If so this is non-trivial to fix. We need to modify the system catalog to support UTF8 and then add full UTF8 support to ColumnStore. This becomes a major feature (a very large piece of work) and should probably be separate. |
| Comment by Andrew Hutchings (Inactive) [ 2017-11-28 ] |
|
So, at first I thought this was a problem with char_length() and substr() internally since I couldn't get it to work when the column size was in fact wide enough. It turns out this was due to me not setting SystemLang in Columnstore.xml. The problem is two-fold: 1. cpimport truncates at the byte level instead of the character level which means for one multi-byte character only part of the character is stored The first problem is solvable by having cpimport use the multibyte string length functions in funcexp to calculate where to truncate. The second problem requires a new multibyte handling library. For this bug I will attempt to solve #1 but #2 is a much bigger feature that is already covered by other tickets such as |
| Comment by Andrew Hutchings (Inactive) [ 2017-11-29 ] |
|
Patch makes cpimport find the correct place to truncate UTF8 data when required. This is a short to medium term hotfix until we can support the full set of MariaDB charsets. Patch is for 1.0, will merge up through 1.1 later. For QA: You can convert this to UTF8 text to copy/paste into a file using: You then need to run char_length() on it and you should see a length value rather than 0 before the patch. |
| Comment by Daniel Lee (Inactive) [ 2017-12-06 ] |
|
Builds verified: GitHub source 1.0.12-1 [root@localhost ~]# cat mariadb-columnstore-1.0.12-1-centos7.x86_64.bin.tar.txt Merge pull request #79 from mariadb-corporation/ /root/columnstore/mariadb-columnstore-server/mariadb-columnstore-engine Merge pull request #338 from mariadb-corporation/ 1.1.3-1 /root/columnstore/mariadb-columnstore-server Merge branch 'develop-1.1' of https://github.com/mariadb-corporation/mariadb-columnstore-server into develop-1.1 /root/columnstore/mariadb-columnstore-server/mariadb-columnstore-engine change how the os_detect is run on remote nodes Verified mentioned test case. Character string was truncated at 98, instead of 100. |