[MDEV-27377] New UUID column type values don't round-trip in binary charset mode Created: 2021-12-28 Updated: 2023-11-28 |
|
| Status: | Confirmed |
| Project: | MariaDB Server |
| Component/s: | Data types |
| Affects Version/s: | 10.7.1 |
| Fix Version/s: | 10.11 |
| Type: | Bug | Priority: | Major |
| Reporter: | Will Bryant | Assignee: | Alexander Barkov |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | uuid | ||
| Environment: |
Ubuntu 20.04 |
||
| Description |
|
The new UUID column type accepts input in multiple formats as documented, but always outputs back in the usual 36-character dashed format. This is great for compatibility IMHO. However, I'm finding that these values don't round-trip correctly if the client has used `mysql_options(&mysql, MYSQL_SET_CHARSET_NAME, "binary");` or more specifically has used `SET character_set_connection = 'binary';` Under this condition I get the error `ERROR 1292 (22007): Incorrect uuid value: 'c26ae0c4-b071-4058-9044-92042d6740fc' for column `ks_test`.`misctbl`.`uuidfield` at row 1` Although it is possible to work around by providing the value in the more packed `X'c26ae0c4b0714058904492042d6740fc'` format, I believe this is a bug because no other column type has this behavior. To my knowledge, all other values round-trip in binary charset mode without conversion (which is why my utility uses it) - whatever the SELECT returns will INSERT again. |
| Comments |
| Comment by Alexander Barkov [ 2021-12-30 ] | |||||||||||||||||||||||||||||||||||||||||||||||
|
A possible solution would be to make UUID (and INET6) respect character_set_results, like all other string data types do:
| |||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Will Bryant [ 2021-12-30 ] | |||||||||||||||||||||||||||||||||||||||||||||||
|
That should fix round-tripping, but it does mean that you'll get a different type of output than "normal", which seems odd to me. Maybe I just haven't noticed this behavior before? With the other string data types, do you get the same output when you have character_set_results NULL or binary as when you have character_set_results matching the database/table/column charset? | |||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Alexander Barkov [ 2021-12-30 ] | |||||||||||||||||||||||||||||||||||||||||||||||
|
Suppose you have a table like this:
Notice utf16 - it uses two bytes for a single ASCII character. If you run mysql --column-type-info test and execute a script like this:
you'll get this output:
What happened:
Notice, you even don't see that the internal column representation is utf16. It can be utf32 or utf8 or latin1. The client-side visible data will be the same while running this script. Now if you run this script:
it returns:
What happened:
So SET NAMES binary, character_set_results=NULL; in case of general purpose string types gives you a direct binary access to the column. UUID and INET6 now work the same way on INSERT:
On SELECT they now work as follows:
If we fix UUID/INET6 to send the internal binary representation if character_set_results is NULL, then they will be consistent with general purpose string data types. | |||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Will Bryant [ 2021-12-31 ] | |||||||||||||||||||||||||||||||||||||||||||||||
|
Right, so what I'm getting at is that if you had INSERTed under the first case - utf8 -> utf16 conversion - then SELECT under the second case, you will get back utf16. But you will also get utf16 if you use `SET NAMES utf16` instead of `SET NAMES binary`, right? So although there is a difference when inserting, when selecting, "binary" behaves the same as selecting the "real" character set, for normal strings? But this would not be the case for the proposed uuid/inet6 behavior. | |||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Alexander Barkov [ 2021-12-31 ] | |||||||||||||||||||||||||||||||||||||||||||||||
Correct.
With `SET NAMES utf16` you'd also get the data in utf16, with metadata saying that this is utf16. However, `SET NAMES utf16` is not supported yet because character_set_client cannot be utf16. The parser can understand only ASCII-based character sets at the moment. You can only do:
which means the following behaviour for CHAR/VARCHAR/TEXT:
Almost. Suppose you have only utf8 CHAR/VARCHAR/TEXT columns, and run these three scripts:
and
and
| |||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Alexander Barkov [ 2021-12-31 ] | |||||||||||||||||||||||||||||||||||||||||||||||
In the proposed behaviour it will work as follows:
Some examples:
| |||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Alexander Barkov [ 2021-12-31 ] | |||||||||||||||||||||||||||||||||||||||||||||||
|
Note, in all cases (no matter which SET NAMES is used), the extended data type information will report the exact data type:
Notice the extra type=inet6 in the Type metadata column (see | |||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Will Bryant [ 2021-12-31 ] | |||||||||||||||||||||||||||||||||||||||||||||||
|
Yeah, I don't love it - I think I would expect that binary will give the same results as using the actual character set of the database, because it is documented as performing no conversion ("To tell the server to perform no conversion of result sets") in the section about character sets in the original mysql docs (I haven't found a similar mariadb page yet). Because of that context, I interpreted that to mean only "no character set conversion", not "this will return the internal representation, which is different to the normal query output", as you are proposing here. Just to understand the issue better, is it hard to parse the text format of UUIDs in binary character set mode? That felt like the simpler solution to me. | |||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Alexander Barkov [ 2021-12-31 ] | |||||||||||||||||||||||||||||||||||||||||||||||
|
Let's look at the problem from a slightly different angle: If we think of UUID and INET6 not as data types, but as special character encodings (pseudo character sets):
the above proposal would work exactly the same for UUID/INET6/UTF16 with respect of:
Now just let's say that we want to use the data type notation rather than the pseduo character set notation, so the table will look like this:
but let's still use the same character_set_xxx infrastructure to determine how conversion is applied and what's sent in the metadata.
In case of UUID it would probably work, because the two representation sizes are fixed:
But in case of INET6 it won't work:
Some 16 byte strings, e.g. ffff:ffff::ff:ff are ambiguous:
To summarize:
The proposed way seems to satisfy all these requirements. But I agree that it may look confusing when we use character_set_xxx variables to control something which is not literally "character set". | |||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Alexander Barkov [ 2021-12-31 ] | |||||||||||||||||||||||||||||||||||||||||||||||
"Character set of the database" is only the default value. It's inherited by columns when you execute a CREATE TABLE and don't specify an explicit character set neither per column nor per table. On INSERT, conversion is applied from character_set_client to character_set_connection and then to character set of the column. The database character set does not matter at this point. On SELECT, conversion is applied between the column character set and character_set_results. The database character set does not matter at this point either. If something in the documentation says that conversion is applied from the database character set to something else, this is probably a mistake in the documentation and should be fixed. Which article is? | |||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Will Bryant [ 2022-01-02 ] | |||||||||||||||||||||||||||||||||||||||||||||||
|
Yeah sorry, I should have said "character set of the column".
Yeah, understand where you're coming from with this design, and the analogy you're making. But they're not character encodings, though, and reusing the character set variables to change the format of these data types - and presumably any other new ones added in the future - seems weird. We don't send the internal binary representation for numeric types, for example. Nor do we expect to parse them in binary format. So why expect to parse these two new types in internal binary format? Just accept them in the same format as all the other character set modes. |