Details
-
Bug
-
Status: Closed (View Workflow)
-
Major
-
Resolution: Fixed
-
10.4(EOL)
-
None
Description
utf16 is not a super-set for ucs2 because these two character sets treat high surrogate codes (0xD800..0xDBFF) and low surrogate codes (0xDC00..0xDFFF) differently:
- ucs2 does not treat surrogates in any special way, so a single surrogate can present in data
- utf16 uses surrogages to encode non-BMP characters, so a single surrogate cannot appear (surrogates can go only in pairs)
Non-instant ALTER catches such bad conversion attempts:
DROP TABLE IF EXISTS t1; |
CREATE TABLE t1 (a VARCHAR(10) CHARACTER SET ucs2, PRIMARY KEY(a)) ENGINE=InnoDB; |
INSERT INTO t1 VALUES ('a'),(0xD800); |
ALTER TABLE t1 ALGORITHM=COPY, MODIFY a VARCHAR(10) CHARACTER SET utf16; |
ERROR 1366 (22007): Incorrect string value: '\xD8\x00' for column `test`.`t1`.`a` at row 2
|
Instant ALTER does not catch surrogates and alters the table silently, so bad data is possible after ALTER:
ALTER TABLE t1 ALGORITHM=INSTANT, MODIFY a VARCHAR(10) CHARACTER SET utf16; |
SELECT HEX(a), OCTET_LENGTH(a), CHAR_LENGTH(a) FROM t1; |
+--------+-----------------+----------------+
|
| HEX(a) | OCTET_LENGTH(a) | CHAR_LENGTH(a) |
|
+--------+-----------------+----------------+
|
| 0061 | 2 | 1 |
|
| D800 | 2 | 0 |
|
+--------+-----------------+----------------+
|
Notice, in the last line OCTET_LENGTH(a) is greater than 0, while CHAR_LENGTH(a) is 0, which is not possible normally.
There are two ways to fix this:
- Disallow surrogates in ucs2
- Disallow instant ALTER for ucs2 to utf16
The former is probably preferable, but can bring previous version compatibility issues.
If we ever disallow surrogates in ucs2, we should probably also disallow them in all other character sets, e.g. utf8, utf8mb4, utf32.
Attachments
Issue Links
- is caused by
-
MDEV-15564 Avoid table rebuild in ALTER TABLE on collation or charset changes
- Closed
- relates to
-
MDEV-19285 INSTANT ALTER from ascii_general_ci to latin1_general_ci produces corrupt data
- Closed
-
MDEV-28323 Redundant Item_func_conv_charset on WHERE utf8mb4_field=utf8mb3_field
- Open