[MDEV-27377] New UUID column type values don't round-trip in binary charset mode Created: 2021-12-28  Updated: 2023-11-28

Status: Confirmed
Project: MariaDB Server
Component/s: Data types
Affects Version/s: 10.7.1
Fix Version/s: 10.11

Type: Bug Priority: Major
Reporter: Will Bryant Assignee: Alexander Barkov
Resolution: Unresolved Votes: 0
Labels: uuid
Environment:

Ubuntu 20.04



 Description   

The new UUID column type accepts input in multiple formats as documented, but always outputs back in the usual 36-character dashed format. This is great for compatibility IMHO.

However, I'm finding that these values don't round-trip correctly if the client has used `mysql_options(&mysql, MYSQL_SET_CHARSET_NAME, "binary");` or more specifically has used `SET character_set_connection = 'binary';`

Under this condition I get the error `ERROR 1292 (22007): Incorrect uuid value: 'c26ae0c4-b071-4058-9044-92042d6740fc' for column `ks_test`.`misctbl`.`uuidfield` at row 1`

Although it is possible to work around by providing the value in the more packed `X'c26ae0c4b0714058904492042d6740fc'` format, I believe this is a bug because no other column type has this behavior. To my knowledge, all other values round-trip in binary charset mode without conversion (which is why my utility uses it) - whatever the SELECT returns will INSERT again.



 Comments   
Comment by Alexander Barkov [ 2021-12-30 ]

A possible solution would be to make UUID (and INET6) respect character_set_results, like all other string data types do:

  • if character_set_results is NULL or binary, return the internal 16-byte binary representation
  • if character_set_results is something else, return the text representation
Comment by Will Bryant [ 2021-12-30 ]

That should fix round-tripping, but it does mean that you'll get a different type of output than "normal", which seems odd to me. Maybe I just haven't noticed this behavior before?

With the other string data types, do you get the same output when you have character_set_results NULL or binary as when you have character_set_results matching the database/table/column charset?

Comment by Alexander Barkov [ 2021-12-30 ]

Suppose you have a table like this:

CREATE OR REPLACE TABLE t1 (a VARCHAR(10) CHARACTER SET utf16);

Notice utf16 - it uses two bytes for a single ASCII character.

If you run mysql --column-type-info test and execute a script like this:

SET NAMES utf8;
TRUNCATE TABLE t1;
INSERT INTO t1 VALUES ('aa');
SELECT HEX(a), a FROM t1;

you'll get this output:

Field   1:  `HEX(a)`
...
Field   2:  `a`
Catalog:    `def`
Database:   `test`
Table:      `t1`
Org_table:  `t1`
Type:       VAR_STRING
Collation:  utf8_general_ci (33)
Length:     30
Max_length: 2
Decimals:   0
Flags:      
 
+----------+------+
| HEX(a)   | a    |
+----------+------+
| 00610061 | aa   |
+----------+------+

What happened:

  • On INSERT, the user input 'aa' (two ASCII letters 'a', '6161' in hex) was converted from character_set_client (utf8) to the column character set (utf16), so HEX value '00610061' was stored (4 bytes, two characters: U+0061 followed by another U+0061).
  • On SELECT, the column value '00610061' was converted from the column character set utf16 to character_set_results (utf8), 'aa' was returned (two bytes, hex '6161'), with the column metadata telling that this is utf8.

Notice, you even don't see that the internal column representation is utf16. It can be utf32 or utf8 or latin1. The client-side visible data will be the same while running this script.

Now if you run this script:

SET NAMES binary, character_set_results=NULL;
TRUNCATE TABLE t1;
INSERT INTO t1 VALUES ('aa');
SELECT HEX(a), a FROM t1;

it returns:

Field   1:  `HEX(a)`
...
Field   2:  `a`
Catalog:    `def`
Database:   `test`
Table:      `t1`
Org_table:  `t1`
Type:       VAR_STRING
Collation:  utf16_general_ci (54)
Length:     40
Max_length: 2
Decimals:   0
Flags:      
 
+--------+------+
| HEX(a) | a    |
+--------+------+
| 6161   | aa   |
+--------+------+

What happened:

  • On INSERT, the user input was reinterpreted from binary data directly to utf16 (without any conversion) and stored two bytes '6161' to the column. So the user input hex '6161' was stored AS IS to the column: two bytes, one character U+6161 (which is a CJK character 慡).
  • On SELECT, the column value also did not undergo any character set conversion. It was sent directly to the client as hex '6161' together with the column metadata telling that this is utf16.

So SET NAMES binary, character_set_results=NULL; in case of general purpose string types gives you a direct binary access to the column.

UUID and INET6 now work the same way on INSERT:

  • In case of SET NAMES utf8 they accept the value in the client text representation (i.e. in the client character set).
  • In case of SET NAMES binary they accept the value in the column internal representation.

On SELECT they now work as follows:

  • In case of character_set_results=utf8 they send the result in the client character set utf8, which is compatible with what general purpose string data types do.
  • But in case of character_set_results=NULL they work in a differrent way - they still send the result in the text 7bit ASCII representation, not in the column internal representation.

If we fix UUID/INET6 to send the internal binary representation if character_set_results is NULL, then they will be consistent with general purpose string data types.

Comment by Will Bryant [ 2021-12-31 ]

Right, so what I'm getting at is that if you had INSERTed under the first case - utf8 -> utf16 conversion - then SELECT under the second case, you will get back utf16. But you will also get utf16 if you use `SET NAMES utf16` instead of `SET NAMES binary`, right?

So although there is a difference when inserting, when selecting, "binary" behaves the same as selecting the "real" character set, for normal strings?

But this would not be the case for the proposed uuid/inet6 behavior.

Comment by Alexander Barkov [ 2021-12-31 ]

Right, so what I'm getting at is that if you had INSERTed under the first case - utf8 -> utf16 conversion - then SELECT under the second case, you will get back utf16.

Correct.

But you will also get utf16 if you use `SET NAMES utf16` instead of `SET NAMES binary`, right?

With `SET NAMES utf16` you'd also get the data in utf16, with metadata saying that this is utf16. However, `SET NAMES utf16` is not supported yet because character_set_client cannot be utf16. The parser can understand only ASCII-based character sets at the moment. You can only do:

SET NAMES binary, character_set_results=utf16;

which means the following behaviour for CHAR/VARCHAR/TEXT:

  • No conversion on INSERT, use the column binary representation (i.e. utf16 for utf16 columns, utf8 for utf8 columns, etc)
  • Conversion to utf16 on SELECT. All string columns are converted from the column character set to utf6. With utf16 reported in the result set metadata. The real column character set is not visible in the metadata.

So although there is a difference when inserting, when selecting, "binary" behaves the same as selecting the "real" character set, for normal strings?

Almost. Suppose you have only utf8 CHAR/VARCHAR/TEXT columns, and run these three scripts:

-- This sends data in utf8 and reports utf8 in the metadata
-- (character set results is NULL, so it sends the column CHARACTER SET in metadata, which is utf8):
SET NAMES binary, character_set_results=NULL;
SELECT utf8col FROM t1;

and

-- This sends data in utf8 and reports "binary" in the metadata
-- (because character_set_results is "binary")
SET NAMES binary;
SELECT utf8col FROM t1;

and

-- This sends data in utf8 and reports "utf8" in the metadata
-- (because character_set_results not NULL and is utf8)
SET NAMES utf8;
SELECT utf8col FROM t1;

Comment by Alexander Barkov [ 2021-12-31 ]

So although there is a difference when inserting, when selecting, "binary" behaves the same as selecting the "real" character set, for normal strings?

But this would not be the case for the proposed uuid/inet6 behavior.

In the proposed behaviour it will work as follows:

  • If character_set_results is either NULL or binary, the underlying binary column representation is returned.
  • If character_set_results is something else, then the text representation is returned, with character set conversion from ASCII to character_set_results.

Some examples:

-- Send the underlying 16 byte binary column representation and send "binary" in the metadata,
-- because character_set_results is NULL and the underlying column representation
-- does not really have a character set - it's an array of bytes (not an array of characters)
SET NAMES binary, character_set_results=NULL;
SELECT uuid_col, inet6_col FROM t1;

-- Send the binary column representation and send "binary" in the metadata
SET NAMES binary;
SELECT uuid_col, inet6_col FROM t1;

-- Send the text representation and "utf8" in the metadata
-- (because character_set_results is utf8)
-- The text representation is kinda converted from ASCII to UTF8.
-- But nothing really is converted because UTF8 is ASCII compatible on 7bit data.
SET NAMES utf8;
SELECT uuid_col, inet6_col FROM t1;

-- Send the text converted to ut16 and "utf16" in the metadata
-- (because character_set_results is utf16)
SET NAMES utf8, character_set_results=utf16;
SELECT uuid_col, inet6_col FROM t1;

Comment by Alexander Barkov [ 2021-12-31 ]

Note, in all cases (no matter which SET NAMES is used), the extended data type information will report the exact data type:

MariaDB [test]> SELECT a FROM t1;
Field   1:  ``
Org_field:  ``
Catalog:    ``
Database:   ``
Table:      ``
Org_table:  ``
Type:       STRING (type=inet6)
Collation:  utf16_general_ci (54)
Length:     156
Max_length: 4
Decimals:   0
Flags:      UNSIGNED BINARY

Notice the extra type=inet6 in the Type metadata column (see MDEV-17832 for details).

Comment by Will Bryant [ 2021-12-31 ]

Yeah, I don't love it - I think I would expect that binary will give the same results as using the actual character set of the database, because it is documented as performing no conversion ("To tell the server to perform no conversion of result sets") in the section about character sets in the original mysql docs (I haven't found a similar mariadb page yet).

Because of that context, I interpreted that to mean only "no character set conversion", not "this will return the internal representation, which is different to the normal query output", as you are proposing here.

Just to understand the issue better, is it hard to parse the text format of UUIDs in binary character set mode? That felt like the simpler solution to me.

Comment by Alexander Barkov [ 2021-12-31 ]

Let's look at the problem from a slightly different angle:

If we think of UUID and INET6 not as data types, but as special character encodings (pseudo character sets):

CREATE TABLE t1 (
  a VARCHAR(64) CHARACTER SET UUID,
  b VARCHAR(64) CHARACTER SET INET6,
  c VARCHAR(64) CHARACTER SET UTF16
);

the above proposal would work exactly the same for UUID/INET6/UTF16 with respect of:

  • INSERT time conversion from the client representation to the column representation
  • SELECT time conversion from the column representation to the client representation
  • SELECT time result set metadata

Now just let's say that we want to use the data type notation rather than the pseduo character set notation, so the table will look like this:

CREATE TABLE t1 (
  a UUID,
  b INET6,
  c VARCHAR(64) CHARACTER SET UTF16
);

but let's still use the same character_set_xxx infrastructure to determine how conversion is applied and what's sent in the metadata.

Just to understand the issue better, is it hard to parse the text format of UUIDs in binary character set mode? That felt like the simpler solution to me.

In case of UUID it would probably work, because the two representation sizes are fixed:

  • binary representation is 16 bytes
  • text representation is 38 bytes

But in case of INET6 it won't work:

  • binary representation is 16 bytes
  • text representations is not fixed: it varies between 2 and 39 bytes.

Some 16 byte strings, e.g. ffff:ffff::ff:ff are ambiguous:

  • it can be a text representation
  • it can be a binary representation at the same time

To summarize:

  • We'd like to avoid introducing a new system variable unless we absolutely can't avoid it.
  • We can not generally rely on representation-by-length detection (binary vs text).
  • We'd like all data types like UUID and INET6 (with a fixed binary representation and a user visible text representation), as well as all other similar data types that we'll add in the future, to have as much close behaviour to each other (and as much close to general purpose types CHAR, VARCHAR, TEXT) as it possible.

The proposed way seems to satisfy all these requirements.

But I agree that it may look confusing when we use character_set_xxx variables to control something which is not literally "character set".

Comment by Alexander Barkov [ 2021-12-31 ]

I think I would expect that binary will give the same results as using the actual character set of the database

"Character set of the database" is only the default value. It's inherited by columns when you execute a CREATE TABLE and don't specify an explicit character set neither per column nor per table.

On INSERT, conversion is applied from character_set_client to character_set_connection and then to character set of the column. The database character set does not matter at this point.

On SELECT, conversion is applied between the column character set and character_set_results. The database character set does not matter at this point either.

If something in the documentation says that conversion is applied from the database character set to something else, this is probably a mistake in the documentation and should be fixed. Which article is?

Comment by Will Bryant [ 2022-01-02 ]

Yeah sorry, I should have said "character set of the column".

If we think of UUID and INET6 not as data types, but as special character encodings (pseudo character sets):

Yeah, understand where you're coming from with this design, and the analogy you're making. But they're not character encodings, though, and reusing the character set variables to change the format of these data types - and presumably any other new ones added in the future - seems weird. We don't send the internal binary representation for numeric types, for example. Nor do we expect to parse them in binary format.

So why expect to parse these two new types in internal binary format? Just accept them in the same format as all the other character set modes.

Generated at Thu Feb 08 09:52:28 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.