[MDEV-25041] CAST produces warnings and different results for formally invalid Unicode, while MySQL's accepts them Created: 2021-03-03 Updated: 2022-03-25 Resolved: 2021-07-26 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Character Sets |
| Affects Version/s: | 10.4.13, 10.4.18, 10.5.6, 10.5.10 |
| Fix Version/s: | N/A |
| Type: | Bug | Priority: | Major |
| Reporter: | Valerii Kravchuk | Assignee: | Sergei Golubchik |
| Resolution: | Not a Bug | Votes: | 1 |
| Labels: | None | ||
| Description |
|
MariaDB's implementation of CAST is different (more "strict") than MySQL's, and this creates additional compatibility problems. Consider two examples, MySQL 8.0.23 first:
In the former we see invalid UTF8 byte value, 0xFF, accepted. In the latter we see "unpaired surrogate" bytes in UTF16, both accepted silently. Now let's consider the same statements in MariaDB:
We see warnings that in DML or DDL case will be errors in strict SQL mode. Moreover, in the latter case with warning a different hex value is produced. IMHO this is a bug, but if intended, I'd ask for some mode/setting that would make CAST work the same way as in MySQL, for compatibility of existing applications. |
| Comments |
| Comment by Brad Chapman [ 2021-07-16 ] | |||||
|
I have confirmed this is occurring in our environment on MariaDB 10.3.25 and 10.4.15. The application contains a table with a column "certificate_data" having data type "mediumblob," which contains binary data ranging from a few hundred to several thousand characters. The application was trying to create a new VIRTUAL column with an expression containing a CAST statement, similar to the example posted above by OP. The new column "cert_blob_lookup_hash" would have a 16-byte binary hash to be used as an index.
This is the error that was returned:
Removing the "CHARACTER SET utf16" portion also returns an error: "Invalid utf8mb4 character string." This reflects the session value for character sets. The documentation for the [CHAR](https://mariadb.com/kb/en/char/) function claims to support a range of 0-255; however, setting CHAR(255) or any other value in that range still produces an invalid utf string error. Changing the function from CHAR(3072) to BINARY(3072) produces a warning, instead of an error:
| |||||
| Comment by Sergei Golubchik [ 2021-07-22 ] | |||||
|
What is a bug here? The database correctly rejects invalid data. Your certificate_data is mediumblob, right? If it contains invalid utf8 data, you'll get an error when trying to cast it to utf16 (because when you cast to utf16 the database should guarantee that the result is a valid utf16 string). If it's longer than 3072 bytes, you'll get a warning casting it to BINARY(3072). Why do you even cast your mediumblob to utf16 or binary(3072) ? | |||||
| Comment by Brad Chapman [ 2021-07-22 ] | |||||
|
I’ll have to ask the vendor who wrote this transform. I’m just the application owner at my company. What makes it valid or invalid? | |||||
| Comment by Brad Chapman [ 2021-07-22 ] | |||||
|
Here is a note I found from the vendor from 6 months ago when this issue first appeared:
| |||||
| Comment by Sergei Golubchik [ 2021-07-22 ] | |||||
|
I only looked at the first example from the bug report, with cast(unhex('FF') as char(1) charset utf8mb4). According to Wikipedia, 0xFF is not a valid UTF-8 character. When you're trying to put invalid data into the database, database doesn't allow you to. This is what databases do. Now, what you (your vendor, I mean) is doing makes no sense to me. There is a mediumblob column with binary data. You want to add a virtual column with a hash over this mediumblob. This could be done as
Casting the mediumblob to utf16 guarantees that the result will be a valid utf16 string or an error. And this is exactly what you get, an error. If you don't need utf16 validation, you don't need to cast. The cast doesn't do anything besides utf16 validation (and truncation, see below). Casting the mediumblob to binary(3072) cases too long blobs to be truncated with a warning. And that's what you get, a warning. If you don't need a warning — don't cast. If you need md5 to apply only to the first 3072 bytes, you can use left(certificate_data, 3072) instead of a cast. | |||||
| Comment by Brad Chapman [ 2021-07-22 ] | |||||
|
Thanks for the comments. I sent a note to the vendor to read your notes. I'm also attempting to elicit their reason for manipulating the data, and if it's referenced anywhere else in the database. Do you think there is a way to get MariaDB to 'emulate' MySQL's less strict handling without throwing an error? Even with 'STRICT_TRANS_TABLES' turned off, the virtual column returns different results when processed by MariaDB vs MySQL. | |||||
| Comment by Sergei Golubchik [ 2021-07-25 ] | |||||
|
I don't think it's very likely. MariaDB never tried to be bug-to-bug compatible with MySQL, so if MySQL has a bug, we generally don't feel like we need to introduce the same bug in MariaDB. And storing invalid data in a database is a bug. It might get fixed in MySQL too eventually. | |||||
| Comment by sbester1 [ 2022-02-10 ] | |||||
|
In official MySQL 8.0.28 the fix for Bug #33199145 improved the scenario, so invalid casted strings are more likely to be rejected than blindly accepted. |