Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-6752

Trailing incomplete characters are not replaced to question marks on conversion

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed (View Workflow)
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 5.5.39, 10.0.13
    • Fix Version/s: 10.0.14
    • Component/s: Character Sets
    • Labels:
      None

      Description

      This script:

      SET NAMES utf8;
      SET @query=CONCAT(_binary"INSERT INTO t1 VALUES('", 0xC2, "\'),('",0xC223,"')");
      SELECT @query;
      DROP TABLE IF EXISTS t1;
      CREATE TABLE t1 (a VARCHAR(10) CHARACTER SET utf8mb4);
      PREPARE stmt FROM @query;
      EXECUTE stmt;
      SHOW WARNINGS;
      SELECT HEX(a),a FROM t1;

      returns the following output:

      MariaDB [test]> SHOW WARNINGS;
      +---------+------+---------------------------------------------------------+
      | Level   | Code | Message                                                 |
      +---------+------+---------------------------------------------------------+
      | Warning | 1265 | Data truncated for column 'a' at row 1                  |
      | Warning | 1366 | Incorrect string value: '\xC2#' for column 'a' at row 2 |
      +---------+------+---------------------------------------------------------+
      2 rows in set (0.00 sec)
       
      MariaDB [test]> SELECT HEX(a),a FROM t1;
      +--------+------+
      | HEX(a) | a    |
      +--------+------+
      |        |      |
      | 3F23   | ?#   |
      +--------+------+
      2 rows in set (0.00 sec)

      Notice:

      • 0xC2 is an incomplete UTF8 character (a valid mbhead not followed by an mbtail).
      • 0xC223 is an invalid sequence (a valid mbhead followed by a 7-bit ASCII character instead of an mbtail)

      Observations:

      • The second row correctly replaced mbhead to question mark and appended '#' to it.
      • The first row did not replace mbhead to '?', it just truncated.
      • The warnings are different. The warning for the second row is more descriptive

      The same effect can be achieved using a Latin1 terminal window.
      The idea is exactly the same. It just uses direct Latin1 input instead of creating a bad sequence using CONCAT and executing it with a prepared statement.

      SET NAMES utf8;
      DROP TABLE IF EXISTS t1;
      CREATE TABLE t1 (a VARCHAR(10) CHARACTER SET latin1);
      INSERT INTO t1 VALUES ('Â'),('Â#');
      SHOW WARNINGS;
      SELECT HEX(a),a FROM t1;

      The column can have any other character set other than utf8, to enable conversion.

      The expected behaviour would be to replace trailing incomplete characters to question marks,
      so the first row returns '?' instead of an empty string, with a more descriptive warning, similar
      to the one returned for the second row.

        Attachments

          Activity

            People

            Assignee:
            bar Alexander Barkov
            Reporter:
            bar Alexander Barkov
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved: