[MDEV-25041] CAST produces warnings and different results for formally invalid Unicode, while MySQL's accepts them Created: 2021-03-03  Updated: 2022-03-25  Resolved: 2021-07-26

Status: Closed
Project: MariaDB Server
Component/s: Character Sets
Affects Version/s: 10.4.13, 10.4.18, 10.5.6, 10.5.10
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Valerii Kravchuk Assignee: Sergei Golubchik
Resolution: Not a Bug Votes: 1
Labels: None


 Description   

MariaDB's implementation of CAST is different (more "strict") than MySQL's, and this creates additional compatibility problems.

Consider two examples, MySQL 8.0.23 first:

openxs@ao756:~/dbs/8.0$ bin/mysql -uroot --socket=/tmp/mysql8.sock
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 10
Server version: 8.0.23 Source distribution
 
Copyright (c) 2000, 2021, Oracle and/or its affiliates.
 
Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.
 
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
 
mysql> select cast(unhex('FF') as char(1) charset utf8mb4);
+----------------------------------------------+
| cast(unhex('FF') as char(1) charset utf8mb4) |
+----------------------------------------------+
| ▒                                             |
+----------------------------------------------+
1 row in set (0.00 sec)
 
mysql> select hex(cast(unhex('FF') as char(1) charset utf8mb4));
+---------------------------------------------------+
| hex(cast(unhex('FF') as char(1) charset utf8mb4)) |
+---------------------------------------------------+
| FF                                                |
+---------------------------------------------------+
1 row in set (0.00 sec)
 
mysql> select CAST(UNHEX('D800AC01') AS CHAR(3072) CHARACTER SET utf16);
+-----------------------------------------------------------+
| CAST(UNHEX('D800AC01') AS CHAR(3072) CHARACTER SET utf16) |
+-----------------------------------------------------------+
| ?¬                                                        |
+-----------------------------------------------------------+
1 row in set (0.00 sec)
 
mysql> select hex(CAST(UNHEX('D800AC01') AS CHAR(3072) CHARACTER SET utf16));
+----------------------------------------------------------------+
| hex(CAST(UNHEX('D800AC01') AS CHAR(3072) CHARACTER SET utf16)) |
+----------------------------------------------------------------+
| D800AC01                                                       |
+----------------------------------------------------------------+
1 row in set (0.00 sec)

In the former we see invalid UTF8 byte value, 0xFF, accepted. In the latter we see "unpaired surrogate" bytes in UTF16, both accepted silently. Now let's consider the same statements in MariaDB:

openxs@ao756:~/dbs/maria10.5$ bin/mysql --socket=/tmp/mariadb.sock
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 3
Server version: 10.5.10-MariaDB MariaDB Server
 
Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.
 
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
 
MariaDB [(none)]> select cast(unhex('FF') as char(1) charset utf8mb4);
+----------------------------------------------+
| cast(unhex('FF') as char(1) charset utf8mb4) |
+----------------------------------------------+
| ?                                            |
+----------------------------------------------+
1 row in set, 1 warning (0,034 sec)
 
MariaDB [(none)]> show warnings\G
*************************** 1. row ***************************
  Level: Warning
   Code: 1300
Message: Invalid utf8mb4 character string: '\xFF'
1 row in set (0,000 sec)
 
MariaDB [(none)]> select CAST(UNHEX('D800AC01') AS CHAR(3072) CHARACTER SET utf16);
+-----------------------------------------------------------+
| CAST(UNHEX('D800AC01') AS CHAR(3072) CHARACTER SET utf16) |
+-----------------------------------------------------------+
| ?¬?                                                       |
+-----------------------------------------------------------+
1 row in set, 1 warning (0,000 sec)
 
MariaDB [(none)]> show warnings\G
*************************** 1. row ***************************
  Level: Warning
   Code: 1300
Message: Invalid utf16 character string: '\xD8\x00\xAC\x01'
1 row in set (0,000 sec)
 
MariaDB [(none)]> select hex(CAST(UNHEX('D800AC01') AS CHAR(3072) CHARACTER SET utf16));
+----------------------------------------------------------------+
| hex(CAST(UNHEX('D800AC01') AS CHAR(3072) CHARACTER SET utf16)) |
+----------------------------------------------------------------+
| 003F00AC003F                                                   |
+----------------------------------------------------------------+
1 row in set, 1 warning (0,000 sec)
 
MariaDB [(none)]> show warnings\G                                               *************************** 1. row ***************************
  Level: Warning
   Code: 1300
Message: Invalid utf16 character string: '\xD8\x00\xAC\x01'
1 row in set (0,000 sec)

We see warnings that in DML or DDL case will be errors in strict SQL mode. Moreover, in the latter case with warning a different hex value is produced.

IMHO this is a bug, but if intended, I'd ask for some mode/setting that would make CAST work the same way as in MySQL, for compatibility of existing applications.



 Comments   
Comment by Brad Chapman [ 2021-07-16 ]

I have confirmed this is occurring in our environment on MariaDB 10.3.25 and 10.4.15.

The application contains a table with a column "certificate_data" having data type "mediumblob," which contains binary data ranging from a few hundred to several thousand characters.

The application was trying to create a new VIRTUAL column with an expression containing a CAST statement, similar to the example posted above by OP. The new column "cert_blob_lookup_hash" would have a 16-byte binary hash to be used as an index.

ALTER TABLE computer__installed_certificates ADD `cert_blob_lookup_hash` binary(16) GENERATED ALWAYS AS (UNHEX(md5(CAST(certificate_data AS CHAR(3072) CHARACTER SET utf16)))) VIRTUAL, ADD INDEX `cert_blob_lookup_hash` (`cert_blob_lookup_hash`)

This is the error that was returned:

/* SQL Error (1300): Invalid utf8 character string: '\x82\x08\x1B0\x82\x06\x03\xA0\x03\x02\x01\x02\x02\x13]\x00\x0...' */

Removing the "CHARACTER SET utf16" portion also returns an error: "Invalid utf8mb4 character string." This reflects the session value for character sets.

The documentation for the [CHAR](https://mariadb.com/kb/en/char/) function claims to support a range of 0-255; however, setting CHAR(255) or any other value in that range still produces an invalid utf string error.

Changing the function from CHAR(3072) to BINARY(3072) produces a warning, instead of an error:

/* Affected rows: 0  Found rows: 0  Warnings: 12  Duration for 1 query: 2.234 sec. */
Warning: Code 1292 
Truncated incorrect BINARY(3072) value: '0\x82\x0D=0\x82\x0C%\xA0\x03\x02\x01\x02\x02\x10*\xC2\xD0?\x14i\x92}t\xAA\xFC\xDE\xD1\x1Db\xA10\x0D\x06\x09*\x86H\x86\xF7\x0D...'

Comment by Sergei Golubchik [ 2021-07-22 ]

What is a bug here? The database correctly rejects invalid data.

Your certificate_data is mediumblob, right? If it contains invalid utf8 data, you'll get an error when trying to cast it to utf16 (because when you cast to utf16 the database should guarantee that the result is a valid utf16 string). If it's longer than 3072 bytes, you'll get a warning casting it to BINARY(3072).

Why do you even cast your mediumblob to utf16 or binary(3072) ?

Comment by Brad Chapman [ 2021-07-22 ]

I’ll have to ask the vendor who wrote this transform. I’m just the application owner at my company. What makes it valid or invalid?

Comment by Brad Chapman [ 2021-07-22 ]

Here is a note I found from the vendor from 6 months ago when this issue first appeared:

What we can share is that we are modifying the computer_installed_certificates by adding a column cert_blob_lookup_hash of type binary(16) and adding non-unique index cert_blob_lookup_hash using column cert_blob_lookup_hash. My guess is that MariaDB does not support this, or the command used to perform this.

This appears to be more of a MariaDB issue, and who it is not able to accept the character string "Invalid utf8mb4 character string". We recommend testing with a supported MySQL version, or reaching out to MariaDB support.

There are some instances online where disabling strict mode will change the ERROR to a WARNING, and that might allow the upgrade to continue. With that said, since we do not support MariaDB we might have unexpected consequences in doing so.

Our options are test with a supported MySQL version, or as noted remove Strict Mode which we verified is enabled in the summary "sql_mode STRICT_TRANS_TABLES,ERROR_FOR_DIVISION_BY_ZERO,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION". Since we do not support MariaDB, we will not know what affect that can have later on after it is disabled.

Comment by Sergei Golubchik [ 2021-07-22 ]

I only looked at the first example from the bug report, with cast(unhex('FF') as char(1) charset utf8mb4).

According to Wikipedia, 0xFF is not a valid UTF-8 character. When you're trying to put invalid data into the database, database doesn't allow you to. This is what databases do.

Now, what you (your vendor, I mean) is doing makes no sense to me. There is a mediumblob column with binary data. You want to add a virtual column with a hash over this mediumblob. This could be done as

ALTER TABLE computer__installed_certificates ADD `cert_blob_lookup_hash` binary(16) GENERATED ALWAYS AS (UNHEX(md5(certificate_data))) VIRTUAL, ADD INDEX `cert_blob_lookup_hash` (`cert_blob_lookup_hash`)

Casting the mediumblob to utf16 guarantees that the result will be a valid utf16 string or an error. And this is exactly what you get, an error. If you don't need utf16 validation, you don't need to cast. The cast doesn't do anything besides utf16 validation (and truncation, see below).

Casting the mediumblob to binary(3072) cases too long blobs to be truncated with a warning. And that's what you get, a warning. If you don't need a warning — don't cast. If you need md5 to apply only to the first 3072 bytes, you can use left(certificate_data, 3072) instead of a cast.

Comment by Brad Chapman [ 2021-07-22 ]

Thanks for the comments. I sent a note to the vendor to read your notes. I'm also attempting to elicit their reason for manipulating the data, and if it's referenced anywhere else in the database.

Do you think there is a way to get MariaDB to 'emulate' MySQL's less strict handling without throwing an error? Even with 'STRICT_TRANS_TABLES' turned off, the virtual column returns different results when processed by MariaDB vs MySQL.

Comment by Sergei Golubchik [ 2021-07-25 ]

I don't think it's very likely. MariaDB never tried to be bug-to-bug compatible with MySQL, so if MySQL has a bug, we generally don't feel like we need to introduce the same bug in MariaDB. And storing invalid data in a database is a bug. It might get fixed in MySQL too eventually.

Comment by sbester1 [ 2022-02-10 ]

In official MySQL 8.0.28 the fix for Bug #33199145 improved the scenario, so invalid casted strings are more likely to be rejected than blindly accepted.

Generated at Thu Feb 08 09:34:40 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.