Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-38214

FTS Sharding fails due to unicode collations map all tokens to aux index 0

    XMLWordPrintable

Details

    Description

      Idea of Fulltext auxiliary table is to distribute the token evenly.

      const  fts_index_selector_t fts_index_selector[] = {
              { 9, "INDEX_1" },
              { 65, "INDEX_2" },
              { 70, "INDEX_3" },
              { 75, "INDEX_4" },
              { 80, "INDEX_5" },
              { 85, "INDEX_6" },
              {  0 , NULL    }
      

      The buckets are:

      INDEX_1: tokens starting with a digit '0'–'9'
      INDEX_2: 'A'–'E'
      INDEX_3: 'F'–'J'
      INDEX_4: 'K'–'O'
      INDEX_5: 'P'–'T'
      INDEX_6: 'U'–'Z'
      

      But in 11.8 version, InnoDB default charset is utf8.
      fts_select_index_by_range():

            ulint                   value = innobase_strnxfrm(cs, str, len);
       
              while (fts_index_selector[selected].value != 0) {
       
                      if (fts_index_selector[selected].value == value) {
       
                              return(selected);
       
                      } else if (fts_index_selector[selected].value > value) {
       
                              return(selected > 0 ? selected - 1 : 0);
                      }
       
                      ++selected;
              }
      

      Below example:

      create table t1(f1 int, f2 char(100), fulltext(f2))engine=innodb;
      insert into t1 values(1, "check"), (2, "floor"), (3, "king"),
      (4, "pawn"), (5, "van"), (6, "123");
      

      In 10.11, all records are inserted in different auxiliary tables.

      But from 11.8 onwards, all records are inserted in 0th auxiliary table.
      innobase_strnxfrm(cs, str, len) always return 33 from 11.8 onwards.

      IIUC, strnxfrm() converts text into a collation-defined byte sequence (sort key) that preserves the collation’s ordering semantics. we read 1 byte from byte sequence to decide
      the fts index partition.

      For unicode collation, many latin letters share the same weight and collapse to same prefix (0x21)

      The same problem happens if we have the table with utf8 collation in lower version like
      the following:

      create table t1(f1 int, f2 char(100), fulltext(f2))CHARSET utf8mb4 COLLATE utf8mb4_unicode_ci engine=innodb;
      insert into t1 values(1, "check"), (2, "floor"), (3, "king"),
                           (4, "pawn"), (5, "van"), (6, "123");
      

      Above t1 table tokens are also ends up with 0th auxiliary table.

      Attachments

        Issue Links

          Activity

            People

              thiru Thirunarayanan Balathandayuthapani
              thiru Thirunarayanan Balathandayuthapani
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.