Uploaded image for project: 'MariaDB ColumnStore'
  1. MariaDB ColumnStore
  2. MCOL-444

split character import issue

    XMLWordPrintable

Details

    • New Feature
    • Status: Closed (View Workflow)
    • Minor
    • Resolution: Fixed
    • 1.0.5
    • 1.0.12, 1.1.3
    • None
    • None
    • 2017-24

    Description

      select name from flights where id=250908;
      ----------------------------------------------------------------------------------------------------

      name

      ----------------------------------------------------------------------------------------------------

      Retargeting - Internal Run of Site - Searches that contain at least one of the following keywords

      ----------------------------------------------------------------------------------------------------
      1 row in set, 1 warning (0.03 sec)

      imysql-rea> select substr(name,1,100) from flights where id=250908;
      --------------------

      substr(name,1,100)

      --------------------

       

      --------------------
      1 row in set (0.02 sec)

      show warnings;
      --------------------------------------------------------------------------

      Level Code Message

      --------------------------------------------------------------------------

      Warning 1366 Incorrect string value: '\xE2\x80' for column 'name' at row 0

      --------------------------------------------------------------------------
      1 row in set (0.00 sec)

      select length(name), char_length(name) from flights where id=250908;
      -------------------------------+

      length(name) char_length(name)

      -------------------------------+

      100 0

      -------------------------------+

      select hex(name) from flights where id=250908;
      ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

      hex(name)

      ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

      5265746172676574696E67202D20496E7465726E616C2052756E206F662053697465202D205365617263686573207468617420636F6E7461696E206174206C65617374206F6E65206F662074686520666F6C6C6F77696E67206B6579776F72647320E280

      ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

      In MySql db:
      Retargeting - Internal Run of Site - Searches that contain at least one of the following keywords - "ocean", "water", "view" or "esplanade"

      The source database is running MySQL 5.6.22
      The data is dumped using mysqldump, sent to the InfiniDB server and imported with cpimport, which we allow to truncate strings that are too long.
      The use of the substr function in the query should be superfluous as far as InfiniDB is concerned, as the column is 100 bytes, it makes the query results compatible with the MySQL servers.
      That doesn't explain the char_length function results above or why InfiniDB is returning an empty string for the substr function.

      I took the hex value and played with it on https://sites.google.com/site/nathanlexwww/tools/utf8-convert
      The last bit, E280, appear to be a truncated multi-byte character from the original data. The '-' character.
      the - code is E28093

      Here is the hex of the original data from MySQL db:

      5265746172676574696E67202D20496E7465726E616C2052756E206F662053697465202D205365617263686573207468617420636F6E7461696E206174206C65617374206F6E65206F662074686520666F6C6C6F77696E67206B6579776F72647320E2809320E2809C6F6365616EE2809D2C20E2809C7761746572E2809D2C20E2809C76696577E2809D206F7220E2809C6573706C616E61646522

      So, cpimport is truncating bytes disregarding the number of bytes in the UTF-8 character. This is bad, need a way to fix these.
      The substr and char_length functions appears to be choking if it is missing all the bits in a multi-byte character, at the end of the string anyway.

      Was not sure how to categorize this, Character Set, Table Corruption or General Usage.

      Making the export process of this table do the substr function up front, is probably the correct way to fix this properly. But that means doing it for all tables with varchar fields and would preclude using mysqldump, making the process much more tedious and take a lot longer.
      The best solution for us is for cpimport to work better with multi-byte character sets (utf-8 for us).

      A few thoughts.
      1. Could be fixed in the InfiniDB engine to ignore or deal with incomplete multi-byte characters.
      2. Add a switch to the cpimport command to put it in a multi-byte character safe mode.
      3. If 2 slows down cpimport very much make a cpimport that is multi-byte character safe, cpimportMBS, to be used when one is importing data in a multi-byte character set. OR even auto detect it based on the data (more accurate), or based on the table or column definition (least accurate).

      In our case, multi-byte characters only appear in the dimension tables, which are much smaller than the fact tables. The fact tables having only a date, int and bigint columns, where import speed is important.

      I'm sure you are all well versed in UTF-8, but thought I would send this along anyway.

      https://en.wikipedia.org/wiki/UTF-8

      So you don't have to examine the whole string to determine if the last character is valid (complete).

      Attachments

        Activity

          People

            dleeyh Daniel Lee (Inactive)
            hill David Hill (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.