Details

    Description

      This is something added in MySQL 5.7, so perhaps it should be included in MariaDB.

      "ngram and MeCab full-text parser plugins. As of MySQL 5.7.6, MySQL provides a built-in full-text ngram parser plugin that supports Chinese, Japanese, and Korean (CJK).

      For more information, see Section 13.9.8, “ngram Full-Text Parser”
      http://dev.mysql.com/doc/refman/5.7/en/mysql-nutshell.html

      And the link referenced above is:
      http://dev.mysql.com/doc/refman/5.7/en/fulltext-search-ngram.html

      The MeCab task previously described here is now listed separated at MDEV-22987

      Attachments

        Issue Links

          Activity

            netbroom Dan Sherry added a comment -

            For anyone who is eagerly awaiting this feature and ended up here like myself, you can currently have this functionality with the Mroonga storage engine, and you can use Mroonga in "wrapper mode" with InnoDB to preserve functionality like transactions: https://mroonga.org/docs/tutorial/wrapper.html

            Ngram parser is extremely useful and removes the need to add a search engine like Sphinx to my stack. Typical FULLTEXT with searching words just is not cutting it for the data I have that contains symbols, URLs, IP address strings, etc.

            You can use this to create a FULLTEXT index that behaves like an ngram parser on Mroonga tables: ALTER TABLE table ADD FULLTEXT (value) COMMENT 'tokenizer "TokenBigramSplitSymbolAlphaDigit"';

            And then to query you use a typical FULLTEXT search in boolean mode: MATCH (value) AGAINST (? IN BOOLEAN MODE)

            Altering existing tables to a new storage engine for me takes way too long on tables with over a million rows, so to make it faster you can do:
            1. set session alter_algorithm='INPLACE';
            2. set foreign_key_checks=0;
            3. create table2 (without indexes) with Mroonga (ENGINE=Mroonga COMMENT='engine "InnoDB"')
            4. insert table1 data into table2
            5. add indexes, including new FULLTEXT
            4. drop all foreign keys that rely on your table
            5. rename table table1 to table1_old;
            6. rename table table2 to table1;
            7. re-add foreign keys
            8. set foreign_key_checks=1;

            Hope this helps for anyone who ended up here.

            netbroom Dan Sherry added a comment - For anyone who is eagerly awaiting this feature and ended up here like myself, you can currently have this functionality with the Mroonga storage engine, and you can use Mroonga in "wrapper mode" with InnoDB to preserve functionality like transactions: https://mroonga.org/docs/tutorial/wrapper.html Ngram parser is extremely useful and removes the need to add a search engine like Sphinx to my stack. Typical FULLTEXT with searching words just is not cutting it for the data I have that contains symbols, URLs, IP address strings, etc. You can use this to create a FULLTEXT index that behaves like an ngram parser on Mroonga tables: ALTER TABLE table ADD FULLTEXT (value) COMMENT 'tokenizer "TokenBigramSplitSymbolAlphaDigit"'; And then to query you use a typical FULLTEXT search in boolean mode: MATCH (value) AGAINST (? IN BOOLEAN MODE) Altering existing tables to a new storage engine for me takes way too long on tables with over a million rows, so to make it faster you can do: 1. set session alter_algorithm='INPLACE'; 2. set foreign_key_checks=0; 3. create table2 (without indexes) with Mroonga (ENGINE=Mroonga COMMENT='engine "InnoDB"') 4. insert table1 data into table2 5. add indexes, including new FULLTEXT 4. drop all foreign keys that rely on your table 5. rename table table1 to table1_old; 6. rename table table2 to table1; 7. re-add foreign keys 8. set foreign_key_checks=1; Hope this helps for anyone who ended up here.
            lathander Jozef Kováč added a comment -

            Any news when or if ngram parser will be implemented?

            lathander Jozef Kováč added a comment - Any news when or if ngram parser will be implemented?
            AnnaTrees Anna Trees added a comment -

            As more and more hosting providers switch from MySQL to MariaDB, this function is becoming increasingly important. I can't understand why this hasn't been implemented for years. MySQL is also OpenSourse, the code from there could be implemented, even if it doesn't work for all character sets. At least the applications that have worked well with MySQL so far could continue to run. I am sure I speak for many developers who have now been more or less involuntarily confronted with the switch from MySQL to MariaDB.

            AnnaTrees Anna Trees added a comment - As more and more hosting providers switch from MySQL to MariaDB, this function is becoming increasingly important. I can't understand why this hasn't been implemented for years. MySQL is also OpenSourse, the code from there could be implemented, even if it doesn't work for all character sets. At least the applications that have worked well with MySQL so far could continue to run. I am sure I speak for many developers who have now been more or less involuntarily confronted with the switch from MySQL to MariaDB.
            netbroom Dan Sherry added a comment -

            Is there any sort of timeline on this? It's very important for searching within data that doesn't align with traditional FULLTEXT, like URLs and strings containing symbols. Ngram is pretty much the only performant way to do this with larger datasets (millions of rows).

            For example this is what GitHub uses with Code Search: https://github.blog/engineering/the-technology-behind-githubs-new-code-search/

            You can use Mroonga but it's not very stable, has trouble with larger tables, can corrupt tables if it crashes, and support and documentation are spotty, so a native ngram parser would be best.

            I noticed a commit from 5 years ago that is awaiting review. Any sort of update on progress for this feature would be greatly appreciated.

            netbroom Dan Sherry added a comment - Is there any sort of timeline on this? It's very important for searching within data that doesn't align with traditional FULLTEXT, like URLs and strings containing symbols. Ngram is pretty much the only performant way to do this with larger datasets (millions of rows). For example this is what GitHub uses with Code Search: https://github.blog/engineering/the-technology-behind-githubs-new-code-search/ You can use Mroonga but it's not very stable, has trouble with larger tables, can corrupt tables if it crashes, and support and documentation are spotty, so a native ngram parser would be best. I noticed a commit from 5 years ago that is awaiting review. Any sort of update on progress for this feature would be greatly appreciated.
            svoj Sergey Vojtovich added a comment -

            serg, why was contribution label removed from this issue? Pull request is still there and is marked as an "External Contribution".

            svoj Sergey Vojtovich added a comment - serg , why was contribution label removed from this issue? Pull request is still there and is marked as an "External Contribution".

            People

              serg Sergei Golubchik
              ccalender Chris Calender (Inactive)
              Votes:
              19 Vote for this issue
              Watchers:
              29 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.