[MDEV-15624] Changing the default character set to utf8mb4 changes query evaluation in a very surprising way Created: 2018-03-21  Updated: 2018-04-05  Resolved: 2018-04-04

Status: Closed
Project: MariaDB Server
Component/s: Character Sets
Affects Version/s: 5.5, 10.0, 10.1, 10.2.13, 10.2, 10.3
Fix Version/s: 5.5.60, 10.0.35, 10.1.33, 10.2.15, 10.3.6

Type: Bug Priority: Major
Reporter: Martin Häcker Assignee: Alexander Barkov
Resolution: Fixed Votes: 0
Labels: upstream-fixed
Environment:

Darwin crest.fritz.box 17.4.0 Darwin Kernel Version 17.4.0: Sun Dec 17 09:19:54 PST 2017; root:xnu-4570.41.2~1/RELEASE_X86_64 x86_64


Issue Links:
Relates
relates to MDEV-15630 uuid() function evaluates at wrong ti... Closed

 Description   

Changing the default character set to utf8mb4 changes query evaluation in a very surprising way. (Please feel free to set a better title).

This is actually a followup bug report to this: https://bitbucket.org/zzzeek/sqlalchemy/issues/4222/query-yields-different-result-via-sqla (see there for more detail)

Here's the problem: This query: ```select seq, replace(uuid(), "-", "") from seq_0_to_9;``` has a completely different result, wether you connect to mysql with a character set of utf8 or utf8mb4.

Here's an example:

% mysql --default-character-set=utf8mb4 yeepa_test --execute 'select seq, replace(uuid(), "-", "") from seq_0_to_9;'
+-----+----------------------------------+
| seq | replace(uuid(), "-", "")         |
+-----+----------------------------------+
|   0 | 49734e8c2d5111e881616c4008b0d40e |
|   1 | 49734e8c2d5111e881616c4008b0d40e |
|   2 | 49734e8c2d5111e881616c4008b0d40e |
|   3 | 49734e8c2d5111e881616c4008b0d40e |
|   4 | 49734e8c2d5111e881616c4008b0d40e |
|   5 | 49734e8c2d5111e881616c4008b0d40e |
|   6 | 49734e8c2d5111e881616c4008b0d40e |
|   7 | 49734e8c2d5111e881616c4008b0d40e |
|   8 | 49734e8c2d5111e881616c4008b0d40e |
|   9 | 49734e8c2d5111e881616c4008b0d40e |
+-----+----------------------------------+

This returns 10 UUIDs that are all the same. *I believe this to be wrong*

Calling that same query with utf8 as the format yields 10 different uuids - which I believe to be correct:

% mysql --default-character-set=utf8 yeepa_test --execute 'select seq, replace(uuid(), "-", "") from seq_0_to_9;' 
+-----+----------------------------------+
| seq | replace(uuid(), "-", "")         |
+-----+----------------------------------+
|   0 | 797f1cf02d5111e881616c4008b0d40e |
|   1 | 797f1d042d5111e881616c4008b0d40e |
|   2 | 797f1d0e2d5111e881616c4008b0d40e |
|   3 | 797f1d0f2d5111e881616c4008b0d40e |
|   4 | 797f1d182d5111e881616c4008b0d40e |
|   5 | 797f1d222d5111e881616c4008b0d40e |
|   6 | 797f1d232d5111e881616c4008b0d40e |
|   7 | 797f1d2c2d5111e881616c4008b0d40e |
|   8 | 797f1d2d2d5111e881616c4008b0d40e |
|   9 | 797f1d362d5111e881616c4008b0d40e |
+-----+----------------------------------+

This also happens on on at least on current Fedora. (I can look up the details tomorrow)

On a further note, this result is



 Comments   
Comment by Elena Stepanova [ 2018-03-21 ]

Thanks for the report and test case.
Reproducible on all of MariaDB 5.5-10.3 and MySQL 5.5-5.7. Not reproducible on 8.0.

set names utf8mb4;
select seq, replace(uuid(), "-", "") from seq_0_to_9;
explain extended select seq, replace(uuid(), "-", "") from seq_0_to_9;
 
set names utf8;
select seq, replace(uuid(), "-", "") from seq_0_to_9;
explain extended select seq, replace(uuid(), "-", "") from seq_0_to_9;

There is a difference in explain which might be related:

utf8mb4

Note	1003	select `test`.`seq_0_to_9`.`seq` AS `seq`,replace(convert(uuid() using utf8mb4),'-','') AS `replace(uuid(), "-", "")` from `test`.`seq_0_to_9`

utf8

Note	1003	select `test`.`seq_0_to_9`.`seq` AS `seq`,replace(uuid(),'-','') AS `replace(uuid(), "-", "")` from `test`.`seq_0_to_9`

Comment by Alexander Barkov [ 2018-04-03 ]

The problem is also repeatable with a real table instead of a sequence:

SET NAMES utf8mb4;
DROP TABLE IF EXISTS t1;
CREATE TABLE t1 (id INT);
INSERT INTO t1 VALUES (1),(2),(3);
SELECT id, replace(uuid(), "-", "") FROM t1;

+------+----------------------------------+
| id   | replace(uuid(), "-", "")         |
+------+----------------------------------+
|    1 | f744162b373211e89127902b3433f757 |
|    2 | f744162b373211e89127902b3433f757 |
|    3 | f744162b373211e89127902b3433f757 |
+------+----------------------------------+

Notice, all values in the second row are the same, which is wrong.

SET NAMES utf8;
DROP TABLE IF EXISTS t1;
CREATE TABLE t1 (id INT);
INSERT INTO t1 VALUES (1),(2),(3);
SELECT id, replace(uuid(), "-", "") FROM t1;

+------+----------------------------------+
| id   | replace(uuid(), "-", "")         |
+------+----------------------------------+
|    1 | 0634726b373311e89127902b3433f757 |
|    2 | 063473b5373311e89127902b3433f757 |
|    3 | 06347664373311e89127902b3433f757 |
+------+----------------------------------+

Notice, all values in the second row are different, which is correct.

Comment by Alexander Barkov [ 2018-04-04 ]

The same problem is repeatable with the INSERT function:

SET NAMES utf8mb4;
DROP TABLE IF EXISTS t1;
CREATE TABLE t1 (id INT);
INSERT INTO t1 VALUES (1),(2),(3);
SELECT id, INSERT(uuid(), 9, 1, "X") FROM t1;

+------+--------------------------------------+
| id   | INSERT(uuid(), 9, 1, "X")            |
+------+--------------------------------------+
|    1 | 43c2ea3aX37b5-11e8-a1fb-902b3433f757 |
|    2 | 43c2ea3aX37b5-11e8-a1fb-902b3433f757 |
|    3 | 43c2ea3aX37b5-11e8-a1fb-902b3433f757 |
+------+--------------------------------------+

If I change the character set from utf8mb4 to utf8, it works fine.

Comment by Alexander Barkov [ 2018-04-04 ]

The same problem is repeatable with UUID_SHORT() as an argument to any string function which enforces character set conversion of UUID_SHORT() to some other character set (e.g. to ucs2):

SET NAMES utf8;
DROP TABLE IF EXISTS t1;
CREATE TABLE t1 (id INT);
INSERT INTO t1 VALUES (1),(2),(3);
SELECT id, REPLACE(uuid_short(), '0', CAST('o' AS CHAR CHARACTER SET ucs2)) AS c FROM t1;

+------+-------------------+
| id   | c                 |
+------+-------------------+
|    1 | 25548518722633758 |
|    2 | 25548518722633758 |
|    3 | 25548518722633758 |
+------+-------------------+

SET NAMES utf8;
DROP TABLE IF EXISTS t1;
CREATE TABLE t1 (id INT);
INSERT INTO t1 VALUES (1),(2),(3);
SELECT id, INSERT(uuid_short(), 1, 1, CAST('0' AS CHAR CHARACTER SET ucs2)) AS c FROM t1;

+------+-------------------+
| id   | c                 |
+------+-------------------+
|    1 | 05548518722633759 |
|    2 | 05548518722633759 |
|    3 | 05548518722633759 |
+------+-------------------+

SET NAMES utf8;
DROP TABLE IF EXISTS t1;
CREATE TABLE t1 (id INT);
INSERT INTO t1 VALUES (1),(2),(3);
SELECT id, CONCAT(uuid_short(), CAST('0' AS CHAR CHARACTER SET ucs2)) AS c FROM t1;

+------+--------------------+
| id   | c                  |
+------+--------------------+
|    1 | 255485187226337610 |
|    2 | 255485187226337610 |
|    3 | 255485187226337610 |
+------+--------------------+

Comment by Alexander Barkov [ 2018-04-04 ]

The problem happens because Item_func_uuid_short::const_item() and Item_func_uuid::const_item() return true. They should return false.

Comment by Alexander Barkov [ 2018-04-04 ]

The problem is repeatable with DISTINCT. All SELECT queries in the below scripts return one row, while they are expected to return three rows:

SET NAMES utf8mb4;
DROP TABLE IF EXISTS t1;
CREATE TABLE t1 (id INT);
INSERT INTO t1 VALUES (1),(2),(3);
SELECT DISTINCT INSERT(uuid(), 9, 1, "X") FROM t1;

SET NAMES utf8;
DROP TABLE IF EXISTS t1;
CREATE TABLE t1 (id INT);
INSERT INTO t1 VALUES (1),(2),(3);
SELECT DISTINCT REPLACE(uuid_short(), '0', CAST('o' AS CHAR CHARACTER SET ucs2)) AS c FROM t1;
SELECT DISTINCT INSERT(uuid_short(), 1, 1, CAST('0' AS CHAR CHARACTER SET ucs2)) AS c FROM t1;
SELECT DISTINCT CONCAT(uuid_short(), CAST('0' AS CHAR CHARACTER SET ucs2)) AS c FROM t1;

Comment by Martin Häcker [ 2018-04-04 ]

Out of interest, could you link the commit / the commits you fixed this in? I can't seem to find them at https://github.com/MariaDB/server

Comment by Elena Stepanova [ 2018-04-04 ]

dwt, you should be able to see the commit link in the JIRA issue itself, right panel, "Development" section.

https://github.com/mariadb/server/commit/6beb08c7b67ed7610e95c0350f9f93005db1e055

Comment by Martin Häcker [ 2018-04-05 ]

Got it, Thanks!

Comment by Alexander Barkov [ 2018-04-05 ]

You didn't see it because https://github.com/MariaDB/server displays the current branch, with is 10.3.

I pushed the change to 5.5, so it's visible here: https://github.com/MariaDB/server/tree/5.5

Note, the patch has not been propagated to 10.3 yet. It will be, when we merge from 5.5 up to 10.3 next time.

Generated at Thu Feb 08 08:22:48 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.