[MDEV-15564] Avoid table rebuild in ALTER TABLE on collation or charset changes - Jira

Details

Type: Task
Status: Closed (View Workflow)
Priority: Critical
Resolution: Fixed
Fix Version/s: 10.4.3
Component/s: Storage Engine - InnoDB
Labels:
- instant
- online-ddl

Description

Some character encodings are proper subsets of others. ASCII is a proper subset of latin1 or latin2 or UTF-8. The 3-byte UTF-8 is a proper subset of the 4-byte UTF-8.

When the character encoding of a column changes such that the storage encoding does not change, we should avoid rebuilding the table.

Likewise, when the collation of a column changes such that the encoding stays the same, we should avoid rebuilding the table.

Collations do not matter if the column is not indexed.

If the collation of an indexed column changes, then the affected indexes may have to be rebuilt. For certain collations we might avoid that as well. For example, changing the encoding and collation from binary 3-byte to binary 4-byte UTF-8 might not require any change.

When the collation change involves a column that is part of the PRIMARY KEY and we have determined that a collation change is necessary, then the whole table will have to be rebuilt.

Attachments

Issue Links

blocks

MDEV-11424 Instant ALTER TABLE of failure-free record format changes

Closed

causes

MDEV-18584 Avoid copying when altering CHAR column in InnoDB table

Confirmed

MDEV-19284 INSTANT ALTER with ucs2-to-utf16 conversion produces bad data

Closed

MDEV-19285 INSTANT ALTER from ascii_general_ci to latin1_general_ci produces corrupt data

Closed

MDEV-19524 Server crashes in Bitmap<64u>::is_clear_all / Field_longstr::csinfo_change_allows_instant_alter

Closed

MDEV-20565 Assertion failure on CHANGE COLUMN…SYSTEM VERSIONING

Closed

MDEV-22333 Assertion `len <= fixed_len' failed in rec_get_converted_size_comp_prefix_low on ALTER + INSERT

Confirmed

MDEV-22334 Assertion `col->len == len' failed in innobase_rename_or_enlarge_column_try on ALTER

Closed

MDEV-27280 server crashes on CHECK TABLE after COLLATE change for utf8mb4

Closed

is blocked by

MDEV-15563 Instant failure-free data type conversions

Closed

MDEV-17965 Allow instant VARCHAR increase of indexed fields

Closed

relates to

MDEV-17773 Avoid table rebuild in ALTER TABLE on collation or charset changes for ENUM and SET

Open

MDEV-27859 Instant change of ENUM is refused because of COLLATE mismatch

Confirmed

MDEV-17301 Change of COLLATE unnecessarily requires ALGORITHM=COPY

Closed

MDEV-26294 Duplicate entries in unique index not detected when changing collation with INPLACE / NOCOPY algoritm

Closed

MDEV-27864 Alter table modify column for same data type copying table

Closed

MDEV-28323 Redundant Item_func_conv_charset on WHERE utf8mb4_field=utf8mb3_field

Open

(4 causes, 2 is blocked by, 6 relates to)

Activity

Ascending order - Click to sort in descending order

Eugene Kosov (Inactive) added a comment - 2018-09-05 12:41 - edited

Not only SYS_COLUMNS.PRTYPE changes on charset/collation change but also SYS_COLUMNS.LEN could increase f.ex. on ascii -> utf8mb3 or utf8mb3 -> utf8mb4.

Eugene Kosov (Inactive) added a comment - 2018-09-05 12:41 - edited Not only SYS_COLUMNS.PRTYPE changes on charset/collation change but also SYS_COLUMNS.LEN could increase f.ex. on ascii -> utf8mb3 or utf8mb3 -> utf8mb4.

Marko Mäkelä added a comment - 2018-09-13 12:48

Right. 10.2 already supports instant changes of the maximum length of VARCHAR (in bytes) except when the old length was less than 256 bytes and the new length is at least 256 bytes. Other changes to wider type would be covered by ~~MDEV-15563~~ (ROW_FORMAT=REDUNDANT) and ~~MDEV-11424~~ (ROW_FORMAT=DYNAMIC, ROW_FORMAT=COMPACT).

As part of this task, we can keep the existing limitation regarding table rebuilds, but still support ALGORITHM=INPLACE, LOCK=NONE instead of the currently only choice ALGORITHM=COPY.

Marko Mäkelä added a comment - 2018-09-13 12:48 Right. 10.2 already supports instant changes of the maximum length of VARCHAR (in bytes) except when the old length was less than 256 bytes and the new length is at least 256 bytes. Other changes to wider type would be covered by MDEV-15563 ( ROW_FORMAT=REDUNDANT ) and MDEV-11424 ( ROW_FORMAT=DYNAMIC , ROW_FORMAT=COMPACT ). As part of this task, we can keep the existing limitation regarding table rebuilds, but still support ALGORITHM=INPLACE, LOCK=NONE instead of the currently only choice ALGORITHM=COPY .

Eugene Kosov (Inactive) added a comment - 2018-09-13 17:23

Current open questions:
1) Can CHAR(200) be converted from utf8mb3 to utf8mb4? It's size should increase from 600 to 800 in that case. If InnoDB really allocates 600 bytes it can't increase it to 800 bytes instantly. This should be examined in source code.
2) There are a lot of charsets in MariaDB and are much more charset/collation pairs. I do not know the all charset/collate -> charset/collate pair which could be altered instantly. E.g. I don't know whether it's possible to instantly convert ascii_general_nopad_ci to latin2_general_nopad_ci.
3) I don't know how to test every combination of charset/collation convert or not convert to other charset/collation instantly. Actually I don't think it's ever needed. I'm trying to write the most generic check code and test every corner case, instead of every possible case.
4) SYS_COLUMNS.PRTYPE and SYS_COLUMNS.LEN are updated on my instant queries. And thus InnoDB cache data structures should be checked somehow. And I don't know a simple way to do it. A complex way could look like ALTER ... INSERT ... SELECT ... ORDER BY.

Eugene Kosov (Inactive) added a comment - 2018-09-13 17:23 Current open questions: 1) Can CHAR(200) be converted from utf8mb3 to utf8mb4 ? It's size should increase from 600 to 800 in that case. If InnoDB really allocates 600 bytes it can't increase it to 800 bytes instantly. This should be examined in source code. 2) There are a lot of charsets in MariaDB and are much more charset/collation pairs. I do not know the all charset/collate -> charset/collate pair which could be altered instantly. E.g. I don't know whether it's possible to instantly convert ascii_general_nopad_ci to latin2_general_nopad_ci . 3) I don't know how to test every combination of charset/collation convert or not convert to other charset/collation instantly. Actually I don't think it's ever needed. I'm trying to write the most generic check code and test every corner case, instead of every possible case. 4) SYS_COLUMNS.PRTYPE and SYS_COLUMNS.LEN are updated on my instant queries. And thus InnoDB cache data structures should be checked somehow. And I don't know a simple way to do it. A complex way could look like ALTER ... INSERT ... SELECT ... ORDER BY .

Marko Mäkelä added a comment - 2018-09-13 19:20

First of all, ROW_FORMAT=REDUNDANT basically treats every field as if it is variable-length and possibly NULL. I think that we can allow any size extension with that format. For other formats (COMPACT, DYNAMIC, COMPRESSED) the following holds:

1) Internally, long CHAR (maybe the limit is 255 bytes) is treated as VARCHAR, and I think that it should be possible to instantly extend this. As an optimization (which is not available in ROW_FORMAT=REDUNDANT) for variable-length character sets, CHAR will be stored as n*mbminlen‥n*mbmaxlen bytes, and trailing space will be trimmed. Only if the ‘payload’ (not counting trailing space) exceeds n*mbminlen bytes, more storage will be allocated. This ought to imply that such CHAR columns are stored as variable-length (explicit length is encoded). I implemented this in MySQL 5.0.3, but cannot remember the details right now.
2), 3) I think that it is OK to start with some safe subset. bar should be able to provide a list.
4) Yes, we should update the dict_table_t. Maybe you could take a look at the code that implements the instant VARCHAR extension, which was introduced in MySQL 5.7.

Marko Mäkelä added a comment - 2018-09-13 19:20 First of all, ROW_FORMAT=REDUNDANT basically treats every field as if it is variable-length and possibly NULL . I think that we can allow any size extension with that format. For other formats ( COMPACT , DYNAMIC , COMPRESSED ) the following holds: 1) Internally, long CHAR (maybe the limit is 255 bytes) is treated as VARCHAR, and I think that it should be possible to instantly extend this. As an optimization (which is not available in ROW_FORMAT=REDUNDANT ) for variable-length character sets, CHAR will be stored as n*mbminlen ‥ n*mbmaxlen bytes, and trailing space will be trimmed. Only if the ‘payload’ (not counting trailing space) exceeds n*mbminlen bytes, more storage will be allocated. This ought to imply that such CHAR columns are stored as variable-length (explicit length is encoded). I implemented this in MySQL 5.0.3, but cannot remember the details right now. 2), 3) I think that it is OK to start with some safe subset. bar should be able to provide a list. 4) Yes, we should update the dict_table_t . Maybe you could take a look at the code that implements the instant VARCHAR extension, which was introduced in MySQL 5.7.

Eugene Kosov (Inactive) added a comment - 2018-09-17 12:01

VARCHAR(200) + ROW_FORMAT=REDUNDANT + UTF8MB3:

Russian 'фыва' occupied 8 bytes, not 4*3=12 bytes.

VARCHAR(200) + ROW_FORMAT=COMPACT + UTF8MB3:

Russian 'фыва' occupied 8 bytes.

VARCHAR(200) + ROW_FORMAT=COMPRESSED + UTF8MB3:

Russian 'фыва' occupied 8 bytes.

VARCHAR(200) + ROW_FORMAT=DYNAMIC + UTF8MB3:

Russian 'фыва' occupied 8 bytes.

CHAR(200) + ROW_FORMAT=REDUNDANT + UTF8MB3:

Russian 'фыва' occupied 600 bytes.

CHAR(200) + ROW_FORMAT=COMPACT + UTF8MB3:

Russian 'фыва' occupied 200 bytes.

CHAR(200) + ROW_FORMAT=COMPRESSED + UTF8MB3:

Russian 'фыва' occupied 200 bytes.

CHAR(200) + ROW_FORMAT=DYNAMIC + UTF8MB3:

Russian 'фыва' occupied 200 bytes.

So, CHAR + ROW_FORMAT=REDUNDANT may need size increase on charset change. Disabling it for simplicity.

Eugene Kosov (Inactive) added a comment - 2018-09-17 12:01 VARCHAR(200) + ROW_FORMAT=REDUNDANT + UTF8MB3: Russian 'фыва' occupied 8 bytes, not 4*3=12 bytes. VARCHAR(200) + ROW_FORMAT=COMPACT + UTF8MB3: Russian 'фыва' occupied 8 bytes. VARCHAR(200) + ROW_FORMAT=COMPRESSED + UTF8MB3: Russian 'фыва' occupied 8 bytes. VARCHAR(200) + ROW_FORMAT=DYNAMIC + UTF8MB3: Russian 'фыва' occupied 8 bytes. CHAR(200) + ROW_FORMAT=REDUNDANT + UTF8MB3: Russian 'фыва' occupied 600 bytes. CHAR(200) + ROW_FORMAT=COMPACT + UTF8MB3: Russian 'фыва' occupied 200 bytes. CHAR(200) + ROW_FORMAT=COMPRESSED + UTF8MB3: Russian 'фыва' occupied 200 bytes. CHAR(200) + ROW_FORMAT=DYNAMIC + UTF8MB3: Russian 'фыва' occupied 200 bytes. So, CHAR + ROW_FORMAT=REDUNDANT may need size increase on charset change. Disabling it for simplicity.

Marko Mäkelä added a comment - 2018-09-18 17:53

Correct: ROW_FORMAT=REDUNDANT will reserve n*mbmaxlen bytes for {{CHAR(n)}}. It will do that even when the column is NULL. The idea is to allow update-in-place. The later ROW_FORMAT optimize the space and allocate n*mbminlen to n*mbmaxlen bytes. Trailing space is ‘compressed away’. For example, 'фыва' is 4×2 bytes. If the column were CHAR(8), we would allocate only 8 bytes for it, because the trailing space 'фыва ' does not need to be stored. If there was one more letter before the trailing space, we would have to allocate more space.

It could be that {{CHAR(n)}} with a variable-length character encoding is actually internally stored as VARCHAR, that is, an explicit length is stored. I cannot remember it.

For ROW_FORMAT=REDUNDANT, I think that it should be OK to bend the existing rules and allow an instant extension of the maximum length of the column, that is, allow CHAR(200) to be converted to a bigger mbmaxlen without changing the existing data. This would be a step towards ~~MDEV-15563~~, which would allow CHAR or VARCHAR to be instantaneously extended arbitrarily in ROW_FORMAT=REDUNDANT.

Marko Mäkelä added a comment - 2018-09-18 17:53 Correct: ROW_FORMAT=REDUNDANT will reserve n*mbmaxlen bytes for {{CHAR(n)}}. It will do that even when the column is NULL . The idea is to allow update-in-place. The later ROW_FORMAT optimize the space and allocate n*mbminlen to n*mbmaxlen bytes. Trailing space is ‘compressed away’. For example, 'фыва' is 4×2 bytes. If the column were CHAR(8) , we would allocate only 8 bytes for it, because the trailing space 'фыва ' does not need to be stored. If there was one more letter before the trailing space, we would have to allocate more space. It could be that {{CHAR(n)}} with a variable-length character encoding is actually internally stored as VARCHAR , that is, an explicit length is stored. I cannot remember it. For ROW_FORMAT=REDUNDANT , I think that it should be OK to bend the existing rules and allow an instant extension of the maximum length of the column, that is, allow CHAR(200) to be converted to a bigger mbmaxlen without changing the existing data. This would be a step towards MDEV-15563 , which would allow CHAR or VARCHAR to be instantaneously extended arbitrarily in ROW_FORMAT=REDUNDANT .

Marko Mäkelä added a comment - 2018-11-09 10:27

Thanks, this looks generally good. I am mainly requesting some code refactoring.

Please document (with examples) in which way the file format is being effectively changed, as witnessed by the change to the CHECK TABLE function btr_index_rec_validate(). We must prevent older MariaDB versions from accessing tables where the file format has been changed.

Marko Mäkelä added a comment - 2018-11-09 10:27 Thanks, this looks generally good. I am mainly requesting some code refactoring. Please document (with examples) in which way the file format is being effectively changed, as witnessed by the change to the CHECK TABLE function btr_index_rec_validate() . We must prevent older MariaDB versions from accessing tables where the file format has been changed.

Marko Mäkelä added a comment - 2019-01-16 07:51

Some necessary code changes are common between this and ~~MDEV-15563~~. I think that ~~MDEV-15563~~ is closer to completion at this point, and should be submitted first.

Marko Mäkelä added a comment - 2019-01-16 07:51 Some necessary code changes are common between this and MDEV-15563 . I think that MDEV-15563 is closer to completion at this point, and should be submitted first.

Alexander Barkov added a comment - 2019-02-11 12:53 - edited

I've taken a look into a number of commits:

@kevgs pushed 6 commits.

20cb37e add GBK
053e48e add UJIS
4e3f6ef add SJIS
54e4cc0 add BIG5
90b4b4f add latin2
3237b5e added latin7

There is no need to list all ASCII compatible character set names and check them using strcmp().
Please check:

(cs->state & MY_CS_NONASCII)

to know if a character set is not ASCII compatible.

Note, SJIS is not ASCII compatible.
This change is probably not correct:
https://github.com/MariaDB/server/commit/54e4cc0b6571a6a929538eca46f940c465e42e68

Alexander Barkov added a comment - 2019-02-11 12:53 - edited I've taken a look into a number of commits: @kevgs pushed 6 commits. 20cb37e add GBK 053e48e add UJIS 4e3f6ef add SJIS 54e4cc0 add BIG5 90b4b4f add latin2 3237b5e added latin7 There is no need to list all ASCII compatible character set names and check them using strcmp(). Please check: (cs->state & MY_CS_NONASCII) to know if a character set is not ASCII compatible. Note, SJIS is not ASCII compatible. This change is probably not correct: https://github.com/MariaDB/server/commit/54e4cc0b6571a6a929538eca46f940c465e42e68

Marko Mäkelä added a comment - 2019-02-14 18:44

Due to conflicts with ~~MDEV-15563~~, I refactored the InnoDB part of this and also the functions Field_str::is_equal() and Field_varstring::is_equal().

The following bugs were filed for limitations that I think we should try to fix soon:
~~MDEV-18583~~ Avoid copying when changing the type of an indexed column
MDEV-18584 Avoid copying when altering CHAR column in InnoDB table

Marko Mäkelä added a comment - 2019-02-14 18:44 Due to conflicts with MDEV-15563 , I refactored the InnoDB part of this and also the functions Field_str::is_equal() and Field_varstring::is_equal() . The following bugs were filed for limitations that I think we should try to fix soon: MDEV-18583 Avoid copying when changing the type of an indexed column MDEV-18584 Avoid copying when altering CHAR column in InnoDB table

People

Assignee:: Eugene Kosov (Inactive)

Reporter:: Marko Mäkelä

Votes:: 3 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 2018-03-14 14:04

Updated:: 2022-05-10 12:32

Resolved:: 2019-02-14 18:44

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server