[MDEV-12161] Can't specify collation for virtual columns - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Fixed
Affects Version/s: 10.2.4, 10.3.11, 10.2.19
Fix Version/s: 10.2.37, 10.3.28, 10.4.18, 10.5.9
Component/s: Virtual Columns
Labels:
- upstream-fixed

Description

You can specify the character set for a virtual column but not the collation. I.e. the following is not allowed:

CREATE TABLE t1(c1 CHAR(10), c2 CHAR(10) COLLATE utf8_bin AS(c1));

but the following is allowed:

CREATE TABLE t1(c1 CHAR(10), c2 CHAR(10) CHARACTER SET utf8 AS(c1));

The reason why one would want to do this would be to allowing several indexes on the same column using different collations. This is possible to achieve using a workaround, and once this is done index lookups and sorting works as expected. To create a table with UTF8 column and having both swedish and german collations:

CREATE TABLE t1(c1 CHAR(10) CHARACTER SET utf8 COLLATE utf8_swedish_ci,

  c2 CHAR(10) AS (c1),

  KEY(c1),

  KEY(c2)) CHARACTER SET utf8 COLLATE utf8_german2_ci;

Using this construct I can now select from the table and use an indexed sort using either german or swedish collations. But I can not explicitly set the collation of the virtual column here, I can only set the source column (c1) to a dofferent collation than the default for the table and let the table default affect the virtual column.

Attachments

Issue Links

duplicates

MDEV-23144 Combination of GENERATE ALWAYS and COLLATE does not work

Closed

is duplicated by

MDEV-20387 COLLATE option not accepted in generated column definition

Closed

links to

mysql Bug #76329

mysql fix - 3e6253fa8c70758a2da0c0bda7d989c582c1027d

Stack overflow

Activity

Ascending order - Click to sort in descending order

View 4 older comments

Gerrit Drost added a comment - 2020-08-04 06:49

We have a system in place that mirrors data from an external source. Since the source data is case sensitive we default the mirrored tables to utf8_bin. Our software can override the collation for specific columns, if necessary. We fix some oddities in the source schema by augmenting the mirrored tables using generated columns.

At the moment we are working on indexing a generated column. Creating the index works fine, but the problem is the index should be case insensitive. For normal columns this is no problem, we can tell our software to change the DDL and set the collation of one column to something else, and that will work fine. For the generated column however, this is not possible because of this bug.

We could change our software to switch everything around: have the tables default to a case insensitive collation, and then by default set the collation for all columns back to utf8_bin. This would work, but we find it kind of risky (lots of testing with lots of data) and costly for something that is essentially a workaround for a bug in MariaDB.

I'd love to help fix this bug, but I have no experience with C/C++ or with database server software. Is there any way I can help progress this bug?

Gerrit Drost added a comment - 2020-08-04 06:49 We have a system in place that mirrors data from an external source. Since the source data is case sensitive we default the mirrored tables to utf8_bin. Our software can override the collation for specific columns, if necessary. We fix some oddities in the source schema by augmenting the mirrored tables using generated columns. At the moment we are working on indexing a generated column. Creating the index works fine, but the problem is the index should be case insensitive. For normal columns this is no problem, we can tell our software to change the DDL and set the collation of one column to something else, and that will work fine. For the generated column however, this is not possible because of this bug. We could change our software to switch everything around: have the tables default to a case insensitive collation, and then by default set the collation for all columns back to utf8_bin. This would work, but we find it kind of risky (lots of testing with lots of data) and costly for something that is essentially a workaround for a bug in MariaDB. I'd love to help fix this bug, but I have no experience with C/C++ or with database server software. Is there any way I can help progress this bug?

Sergei Golubchik added a comment - 2020-09-08 08:44

gdrost, note that it possible to specify a collation for a generated column using a different syntax, see my first comment.

This bug, when fixed, will allow the first (COLLATE before AS) syntax to work, but it will not add any new functionality, it will only allow to do what is already possible but using a different order of keywords.

Sergei Golubchik added a comment - 2020-09-08 08:44 gdrost , note that it possible to specify a collation for a generated column using a different syntax, see my first comment. This bug, when fixed, will allow the first ( COLLATE before AS ) syntax to work, but it will not add any new functionality, it will only allow to do what is already possible but using a different order of keywords.

Michael van Eerd added a comment - 2020-09-16 14:57

@serg
I've tried testing the code snippet in your first comment but have been unable to get this to work, which version did you confirm this worked on at the time?
I've tried it on the latest docker containers for 10.2, 10.3, 10.4 and 10.5
c2 still has the table collation, not the one set in the query, despite using the exact SQL syntax you suggested

The workaround in the initial post does work, by setting the table collation to utf8_unicode_ci and collate all the non-virtual columns back to utf8_bin
But as described in the comment by @gdrost this would be quite cumbersome to test and implement in our current environment and would limit us to one collation for all the virtual columns.

If you could confirm whether the SQL in your first comment still works and on which version that would be useful
If it indeed no longer works and is a regression should we open a new bug for that, since this bug is about this specific syntax?

Any other suggestions on how we could make this work in a recent stable release of MariaDB would be much appreciated.
I've included a minimal snippet below to reproduce our use-case and what we tried so far

If there is anything else we can do to help progress this along I'll be glad to provide what I can

 -- This shows the problem, CI_CONCAT column should be utf8_unicode_ci but is still utf8_bin (table collation)

 CREATE TABLE IF NOT EXISTS `Collation_test`

	`BIN_COLUMN`                 VARCHAR(10), -- utf8_bin

    `CI_COLUMN1`                 VARCHAR(35) CHARACTER SET "utf8" COLLATE "utf8_unicode_ci", -- utf8_unicode_ci

    `CI_COLUMN2`                 VARCHAR(25) CHARACTER SET "utf8" COLLATE "utf8_unicode_ci", -- utf8_unicode_ci

    `CI_CONCAT`                  VARCHAR(60) AS (CONCAT(CI_COLUMN1, CI_COLUMN2)) VIRTUAL, -- utf8_bin, expected utf8_unicode_ci

    `CI_CONCAT2`                 VARCHAR(60) AS (CONCAT(CI_COLUMN1 COLLATE "utf8_unicode_ci", CI_COLUMN2 COLLATE "utf8_unicode_ci")), -- utf8_bin, expected utf8_unicode_ci

    `CI_CONCAT3`                 VARCHAR(60) AS (CONCAT(CI_COLUMN1, CI_COLUMN2) COLLATE "utf8_unicode_ci") -- utf8_bin, expected utf8_unicode_ci

    -- -- The syntax below throws a syntax error in MariaDB, see MDEV-12161

    -- `CI_CONCAT`                 VARCHAR(60) COLLATE "utf8_unicode_ci" AS (CONCAT(CI_COLUMN1, CI_COLUMN2)) VIRTUAL

) CHARACTER SET "utf8"

  COLLATE "utf8_bin"

  ENGINE = InnoDB;

Michael van Eerd added a comment - 2020-09-16 14:57 @serg I've tried testing the code snippet in your first comment but have been unable to get this to work, which version did you confirm this worked on at the time? I've tried it on the latest docker containers for 10.2, 10.3, 10.4 and 10.5 c2 still has the table collation, not the one set in the query, despite using the exact SQL syntax you suggested The workaround in the initial post does work, by setting the table collation to utf8_unicode_ci and collate all the non-virtual columns back to utf8_bin But as described in the comment by @gdrost this would be quite cumbersome to test and implement in our current environment and would limit us to one collation for all the virtual columns. If you could confirm whether the SQL in your first comment still works and on which version that would be useful If it indeed no longer works and is a regression should we open a new bug for that, since this bug is about this specific syntax? Any other suggestions on how we could make this work in a recent stable release of MariaDB would be much appreciated. I've included a minimal snippet below to reproduce our use-case and what we tried so far If there is anything else we can do to help progress this along I'll be glad to provide what I can -- This shows the problem, CI_CONCAT column should be utf8_unicode_ci but is still utf8_bin (table collation) CREATE TABLE IF NOT EXISTS `Collation_test` ( `BIN_COLUMN` VARCHAR(10), -- utf8_bin `CI_COLUMN1` VARCHAR(35) CHARACTER SET "utf8" COLLATE "utf8_unicode_ci", -- utf8_unicode_ci `CI_COLUMN2` VARCHAR(25) CHARACTER SET "utf8" COLLATE "utf8_unicode_ci", -- utf8_unicode_ci `CI_CONCAT` VARCHAR(60) AS (CONCAT(CI_COLUMN1, CI_COLUMN2)) VIRTUAL, -- utf8_bin, expected utf8_unicode_ci `CI_CONCAT2` VARCHAR(60) AS (CONCAT(CI_COLUMN1 COLLATE "utf8_unicode_ci", CI_COLUMN2 COLLATE "utf8_unicode_ci")), -- utf8_bin, expected utf8_unicode_ci `CI_CONCAT3` VARCHAR(60) AS (CONCAT(CI_COLUMN1, CI_COLUMN2) COLLATE "utf8_unicode_ci") -- utf8_bin, expected utf8_unicode_ci -- -- The syntax below throws a syntax error in MariaDB, see MDEV-12161 -- `CI_CONCAT` VARCHAR(60) COLLATE "utf8_unicode_ci" AS (CONCAT(CI_COLUMN1, CI_COLUMN2)) VIRTUAL ) CHARACTER SET "utf8" COLLATE "utf8_bin" ENGINE = InnoDB;

Ryan Leadenham added a comment - 2020-10-30 16:28

Just coming across this issue. For me I have a field with JSON data (LONGTEXT utf8mb4_bin), then I have a few STORED fields that reference data at certain paths in the JSON. I can add the COLLATE utfmb4_bin in the virtual definition, but the column still reports as the table default (utf8mb4_unicode_ci). I'm definitely no expert in this, but what it seems like would happen is when updating the JSON field, data would be eventually converted to utf8mb4_unicode_ci and stored that way (unless generated fields silently ignore the collation?). Then when reading, nothing would convert it back to utf8_bin. I would have to imagine this would cause some inconsistencies (if it doesn't then there wouldn't be a different between utf8mb4_bin and utf8mb4_unicode_ci).

Even if it would end up working as expected, it's very confusing both when reading the table structure and trying to update it. So even if this technically isn't a bug it would still be an extremely nice feature to have.

Ryan Leadenham added a comment - 2020-10-30 16:28 Just coming across this issue. For me I have a field with JSON data (LONGTEXT utf8mb4_bin), then I have a few STORED fields that reference data at certain paths in the JSON. I can add the COLLATE utfmb4_bin in the virtual definition, but the column still reports as the table default (utf8mb4_unicode_ci). I'm definitely no expert in this, but what it seems like would happen is when updating the JSON field, data would be eventually converted to utf8mb4_unicode_ci and stored that way (unless generated fields silently ignore the collation?). Then when reading, nothing would convert it back to utf8_bin. I would have to imagine this would cause some inconsistencies (if it doesn't then there wouldn't be a different between utf8mb4_bin and utf8mb4_unicode_ci). Even if it would end up working as expected, it's very confusing both when reading the table structure and trying to update it. So even if this technically isn't a bug it would still be an extremely nice feature to have.

Gerrit Drost added a comment - 2020-11-17 09:39 - edited

I think the example @MvEerd provided proves this is actually more than just a syntax bug, which Sergei stated on 2020-09-08.

It seems impossible right now to have a generated column with a collation different from the table collation. It would be nice if someone could confirm this is a bug, or give some pointers how to work around this issue.

Gerrit Drost added a comment - 2020-11-17 09:39 - edited I think the example @MvEerd provided proves this is actually more than just a syntax bug, which Sergei stated on 2020-09-08. It seems impossible right now to have a generated column with a collation different from the table collation. It would be nice if someone could confirm this is a bug, or give some pointers how to work around this issue.

MariaDB Server

Can't specify collation for virtual columns

Details

Description

Attachments

Issue Links

Activity

People

Dates

Git Integration