Details
-
Bug
-
Status: Open (View Workflow)
-
Major
-
Resolution: Unresolved
-
10.6, 10.11, 11.4, 12.2, 11.8
-
None
Description
Idea of Fulltext auxiliary table is to distribute the token evenly.
const fts_index_selector_t fts_index_selector[] = {
|
{ 9, "INDEX_1" },
|
{ 65, "INDEX_2" },
|
{ 70, "INDEX_3" },
|
{ 75, "INDEX_4" },
|
{ 80, "INDEX_5" },
|
{ 85, "INDEX_6" },
|
{ 0 , NULL }
|
The buckets are:
INDEX_1: tokens starting with a digit '0'–'9'
|
INDEX_2: 'A'–'E'
|
INDEX_3: 'F'–'J'
|
INDEX_4: 'K'–'O'
|
INDEX_5: 'P'–'T'
|
INDEX_6: 'U'–'Z'
|
But in 11.8 version, InnoDB default charset is utf8.
fts_select_index_by_range():
ulint value = innobase_strnxfrm(cs, str, len);
|
|
|
while (fts_index_selector[selected].value != 0) {
|
|
|
if (fts_index_selector[selected].value == value) {
|
|
|
return(selected);
|
|
|
} else if (fts_index_selector[selected].value > value) {
|
|
|
return(selected > 0 ? selected - 1 : 0);
|
}
|
|
|
++selected;
|
}
|
Below example:
create table t1(f1 int, f2 char(100), fulltext(f2))engine=innodb;
|
insert into t1 values(1, "check"), (2, "floor"), (3, "king"),
|
(4, "pawn"), (5, "van"), (6, "123");
|
In 10.11, all records are inserted in different auxiliary tables.
But from 11.8 onwards, all records are inserted in 0th auxiliary table.
innobase_strnxfrm(cs, str, len) always return 33 from 11.8 onwards.
IIUC, strnxfrm() converts text into a collation-defined byte sequence (sort key) that preserves the collation’s ordering semantics. we read 1 byte from byte sequence to decide
the fts index partition.
For unicode collation, many latin letters share the same weight and collapse to same prefix (0x21)
The same problem happens if we have the table with utf8 collation in lower version like
the following:
create table t1(f1 int, f2 char(100), fulltext(f2))CHARSET utf8mb4 COLLATE utf8mb4_unicode_ci engine=innodb;
|
insert into t1 values(1, "check"), (2, "floor"), (3, "king"),
|
(4, "pawn"), (5, "van"), (6, "123");
|
Above t1 table tokens are also ends up with 0th auxiliary table.
Attachments
Issue Links
- relates to
-
MDEV-19123 Change default charset from latin1 to utf8mb4
-
- Closed
-
-
MDEV-25848 Support for Multi-Valued Indexes
-
- Open
-