[MDEV-19284] INSTANT ALTER with ucs2-to-utf16 conversion produces bad data Created: 2019-04-19  Updated: 2022-04-15  Resolved: 2019-05-16

Status: Closed
Project: MariaDB Server
Component/s: Character Sets, Storage Engine - InnoDB
Affects Version/s: 10.4
Fix Version/s: 10.4.5

Type: Bug Priority: Major
Reporter: Alexander Barkov Assignee: Alexander Barkov
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Problem/Incident
is caused by MDEV-15564 Avoid table rebuild in ALTER TABLE on... Closed
Relates
relates to MDEV-19285 INSTANT ALTER from ascii_general_ci t... Closed
relates to MDEV-28323 Redundant Item_func_conv_charset on W... Open

 Description   

utf16 is not a super-set for ucs2 because these two character sets treat high surrogate codes (0xD800..0xDBFF) and low surrogate codes (0xDC00..0xDFFF) differently:

  • ucs2 does not treat surrogates in any special way, so a single surrogate can present in data
  • utf16 uses surrogages to encode non-BMP characters, so a single surrogate cannot appear (surrogates can go only in pairs)

Non-instant ALTER catches such bad conversion attempts:

DROP TABLE IF EXISTS t1;
CREATE TABLE t1 (a VARCHAR(10) CHARACTER SET ucs2, PRIMARY KEY(a)) ENGINE=InnoDB;
INSERT INTO t1 VALUES ('a'),(0xD800);
ALTER TABLE t1 ALGORITHM=COPY, MODIFY a VARCHAR(10) CHARACTER SET utf16;

ERROR 1366 (22007): Incorrect string value: '\xD8\x00' for column `test`.`t1`.`a` at row 2

Instant ALTER does not catch surrogates and alters the table silently, so bad data is possible after ALTER:

ALTER TABLE t1 ALGORITHM=INSTANT, MODIFY a VARCHAR(10) CHARACTER SET utf16;
SELECT HEX(a), OCTET_LENGTH(a), CHAR_LENGTH(a) FROM t1;

+--------+-----------------+----------------+
| HEX(a) | OCTET_LENGTH(a) | CHAR_LENGTH(a) |
+--------+-----------------+----------------+
| 0061   |               2 |              1 |
| D800   |               2 |              0 |
+--------+-----------------+----------------+

Notice, in the last line OCTET_LENGTH(a) is greater than 0, while CHAR_LENGTH(a) is 0, which is not possible normally.

There are two ways to fix this:

  • Disallow surrogates in ucs2
  • Disallow instant ALTER for ucs2 to utf16

The former is probably preferable, but can bring previous version compatibility issues.

If we ever disallow surrogates in ucs2, we should probably also disallow them in all other character sets, e.g. utf8, utf8mb4, utf32.



 Comments   
Comment by Marko Mäkelä [ 2019-04-23 ]

I see one test failure that demonstrates a regression:

CURRENT_TEST: innodb.instant_alter_charset
--- /mariadb/10.4/mysql-test/suite/innodb/r/instant_alter_charset,redundant.result~	2019-04-23 14:52:47.310794160 +0300
+++ /mariadb/10.4/mysql-test/suite/innodb/r/instant_alter_charset,redundant.reject	2019-04-23 14:52:48.586809875 +0300
@@ -267,6 +267,7 @@
 alter table boundary_255
 modify b varchar(200) charset utf8mb3,
 algorithm=instant;
+ERROR 0A000: ALGORITHM=INSTANT is not supported. Reason: Cannot change column type. Try ALGORITHM=COPY
 alter table boundary_255
 modify c varchar(300) charset utf8mb3,
 algorithm=instant;
 
mysqltest: Result length mismatch

For ROW_FORMAT=REDUNDANT, we can and should allow an instantaneous conversion of a VARCHAR column from 50*3 bytes to 200*3 bytes. For other InnoDB ROW_FORMAT, we cannot allow this, because the maximum length is growing from 128‥255 bytes to more than 255 bytes.

It seems that the logic that is present in Field_varstring::is_equal() is not being correctly applied in all cases. Note: table->file->ha_table_flags() & HA_EXTENDED_TYPES_CONVERSION will distinguish ROW_FORMAT=REDUNDANT.

On a related note (see MDEV-18584), for CHAR where mbminlen<mbmaxlen, InnoDB would internally use a variable-length encoding of n*mbminlenn*mbmaxlen bytes except when ROW_FORMAT=REDUNDANT. Such CHAR columns can allow instantaneous changes, say, from utf8mb3 to utf8mb4, provided that the n*mbmaxlen is not growing from 128‥255 bytes to more than 255 bytes.

Generated at Thu Feb 08 08:50:27 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.