[MDEV-38214] FTS Sharding fails due to unicode collations map all tokens to aux index 0 - Jira

XML

Word

Printable

Details

Type: Bug
Status: Open (View Workflow)
Priority: Major
Resolution: Unresolved
Affects Version/s: 10.6, 10.11, 11.4, 11.8, 12.2
Fix Version/s: 11.8, 12.2
Component/s: Full-text Search, Storage Engine - InnoDB
Labels:
None

Description

Idea of Fulltext auxiliary table is to distribute the token evenly.

const  fts_index_selector_t fts_index_selector[] = {

        { 9, "INDEX_1" },

        { 65, "INDEX_2" },

        { 70, "INDEX_3" },

        { 75, "INDEX_4" },

        { 80, "INDEX_5" },

        { 85, "INDEX_6" },

        {  0 , NULL    }

The buckets are:

INDEX_1: tokens starting with a digit '0'–'9'

INDEX_2: 'A'–'E'

INDEX_3: 'F'–'J'

INDEX_4: 'K'–'O'

INDEX_5: 'P'–'T'

INDEX_6: 'U'–'Z'

But in 11.8 version, InnoDB default charset is utf8.
fts_select_index_by_range():

      ulint                   value = innobase_strnxfrm(cs, str, len);

        while (fts_index_selector[selected].value != 0) {

                if (fts_index_selector[selected].value == value) {

                        return(selected);

                } else if (fts_index_selector[selected].value > value) {

                        return(selected > 0 ? selected - 1 : 0);

                ++selected;

Below example:

create table t1(f1 int, f2 char(100), fulltext(f2))engine=innodb;

insert into t1 values(1, "check"), (2, "floor"), (3, "king"),

(4, "pawn"), (5, "van"), (6, "123");

In 10.11, all records are inserted in different auxiliary tables.

But from 11.8 onwards, all records are inserted in 0th auxiliary table.
innobase_strnxfrm(cs, str, len) always return 33 from 11.8 onwards.

IIUC, strnxfrm() converts text into a collation-defined byte sequence (sort key) that preserves the collation’s ordering semantics. we read 1 byte from byte sequence to decide
the fts index partition.

For unicode collation, many latin letters share the same weight and collapse to same prefix (0x21)

The same problem happens if we have the table with utf8 collation in lower version like
the following:

create table t1(f1 int, f2 char(100), fulltext(f2))CHARSET utf8mb4 COLLATE utf8mb4_unicode_ci engine=innodb;

insert into t1 values(1, "check"), (2, "floor"), (3, "king"),

                     (4, "pawn"), (5, "van"), (6, "123");

Above t1 table tokens are also ends up with 0th auxiliary table.

Attachments

Issue Links

relates to

MDEV-19123 Change default charset from latin1 to utf8mb4

Closed

MDEV-25848 Support for Multi-Valued Indexes

Open

Activity

People

Assignee:: Thirunarayanan Balathandayuthapani

Reporter:: Thirunarayanan Balathandayuthapani

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 2025-11-27 08:05

Updated:: 2025-12-01 10:52

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.