[MDEV-6218] Wrong result of CHAR_LENGTH(non-BMP-character) with 3-byte utf8 Created: 2014-05-07  Updated: 2023-11-13

Status: Stalled
Project: MariaDB Server
Component/s: Character Sets
Affects Version/s: 10.0.10, 10.2, 10.3
Fix Version/s: 10.5

Type: Bug Priority: Major
Reporter: Alexander Barkov Assignee: Alexander Barkov
Resolution: Unresolved Votes: 0
Labels: upstream

Issue Links:
Relates
relates to MDEV-6643 Improve performance of string process... Stalled

 Description   

mysql> SET NAMES utf8;
Query OK, 0 rows affected (0.00 sec)
 
mysql> SELECT hex('��'), char_length('��'), octet_length('��');
+-------------+---------------------+----------------------+
| hex('��')     | char_length('��')     | octet_length('��')     |
+-------------+---------------------+----------------------+
| F09F9881    |                   4 |                    4 |
+-------------+---------------------+----------------------+
1 row in set (0.00 sec)

Notice, I use "SET NAMES utf8" (which is a 3-byte character set
and supports only BMP characters), but then input a 4-byte character.
The result of CHAR_LENGTH() is wrong.

0xF09F9881 is a wrong byte sequence of utf8 (it's correct for utf8mb4 only)

The expected result would be:

  • either return error for the entire query
  • or replace the character to '?' and thus make CHAR_LENGTH() return 1.


 Comments   
Comment by Sergei Petrunia [ 2014-05-07 ]

Repeatable on mysql-5.6.17

Comment by Alexander Barkov [ 2014-06-10 ]

LEFT also returns a wrong result:

MariaDB [test]> SELECT hex(left('��',2));
+---------------------+
| hex(left('��',2))     |
+---------------------+
| F09F                |
+---------------------+
1 row in set (0.00 sec)

Comment by Alexander Barkov [ 2014-07-25 ]

RIGHT returns a wrong result:

MariaDB [test]> SELECT hex(right('��',2));
+----------------------+
| hex(right('��',2))     |
+----------------------+
| 9881                 |
+----------------------+

Comment by Alexander Barkov [ 2014-07-25 ]

SUBSTRING returns a wrong result:

MariaDB [test]> SELECT hex(substring('��',2,1));
+----------------------------+
| hex(substring('��',2,1))     |
+----------------------------+
| 9F                         |
+----------------------------+

Comment by Alexander Barkov [ 2014-07-25 ]

In this example, the returned string is also bad formed:

MariaDB [test]> SELECT '11��2��22222';
+------------------+
| 11��2��22222         |
+------------------+
| 11��2��22222         |
+------------------+

It should probably replace unknown bytes to question marks.

Generated at Thu Feb 08 07:10:14 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.