Details

    Description

      This is something added in MySQL 5.7, so perhaps it should be included in MariaDB.

      "ngram and MeCab full-text parser plugins. As of MySQL 5.7.6, MySQL provides a built-in full-text ngram parser plugin that supports Chinese, Japanese, and Korean (CJK).

      For more information, see Section 13.9.8, “ngram Full-Text Parser”
      http://dev.mysql.com/doc/refman/5.7/en/mysql-nutshell.html

      And the link referenced above is:
      http://dev.mysql.com/doc/refman/5.7/en/fulltext-search-ngram.html

      The MeCab task previously described here is now listed separated at MDEV-22987

      Attachments

        Issue Links

          Activity

            greenman, thank you for pointing out the existing design problem with the various innodb_ft_ global variables. It could actually be acceptable to continue on the same path, to avoid the extra development at this stage. I think that we should try to allow the ngram token size to be changed freely between 2 and 10. Preferably it should be done without any server restart (maybe with plugin uninstall/install if necessary) so that we can reasonably write a test that changes the parameter between writes and reads. (In my opinion, it is unreasonable to write a regression test that restarts the entire server many times, because a single restart can easily take 1 or 2 seconds of time.) We must ensure that such index corruption will not lead into any crashes.

            I really hope that we can replace the InnoDB FULLTEXT INDEX implementation in some not too distant major release. https://mariadb.com/resources/blog/initial-impressions-of-innodb-fulltext/ pointed out several design issues that I do not think can be addressed easily.

            marko Marko Mäkelä added a comment - greenman , thank you for pointing out the existing design problem with the various innodb_ft_ global variables. It could actually be acceptable to continue on the same path, to avoid the extra development at this stage. I think that we should try to allow the ngram token size to be changed freely between 2 and 10. Preferably it should be done without any server restart (maybe with plugin uninstall/install if necessary) so that we can reasonably write a test that changes the parameter between writes and reads. (In my opinion, it is unreasonable to write a regression test that restarts the entire server many times, because a single restart can easily take 1 or 2 seconds of time.) We must ensure that such index corruption will not lead into any crashes. I really hope that we can replace the InnoDB FULLTEXT INDEX implementation in some not too distant major release. https://mariadb.com/resources/blog/initial-impressions-of-innodb-fulltext/ pointed out several design issues that I do not think can be addressed easily.
            netbroom Dan Sherry added a comment -

            For anyone who is eagerly awaiting this feature and ended up here like myself, you can currently have this functionality with the Mroonga storage engine, and you can use Mroonga in "wrapper mode" with InnoDB to preserve functionality like transactions: https://mroonga.org/docs/tutorial/wrapper.html

            Ngram parser is extremely useful and removes the need to add a search engine like Sphinx to my stack. Typical FULLTEXT with searching words just is not cutting it for the data I have that contains symbols, URLs, IP address strings, etc.

            You can use this to create a FULLTEXT index that behaves like an ngram parser on Mroonga tables: ALTER TABLE table ADD FULLTEXT (value) COMMENT 'tokenizer "TokenBigramSplitSymbolAlphaDigit"';

            And then to query you use a typical FULLTEXT search in boolean mode: MATCH (value) AGAINST (? IN BOOLEAN MODE)

            Altering existing tables to a new storage engine for me takes way too long on tables with over a million rows, so to make it faster you can do:
            1. set session alter_algorithm='INPLACE';
            2. set foreign_key_checks=0;
            3. create table2 (without indexes) with Mroonga (ENGINE=Mroonga COMMENT='engine "InnoDB"')
            4. insert table1 data into table2
            5. add indexes, including new FULLTEXT
            4. drop all foreign keys that rely on your table
            5. rename table table1 to table1_old;
            6. rename table table2 to table1;
            7. re-add foreign keys
            8. set foreign_key_checks=1;

            Hope this helps for anyone who ended up here.

            netbroom Dan Sherry added a comment - For anyone who is eagerly awaiting this feature and ended up here like myself, you can currently have this functionality with the Mroonga storage engine, and you can use Mroonga in "wrapper mode" with InnoDB to preserve functionality like transactions: https://mroonga.org/docs/tutorial/wrapper.html Ngram parser is extremely useful and removes the need to add a search engine like Sphinx to my stack. Typical FULLTEXT with searching words just is not cutting it for the data I have that contains symbols, URLs, IP address strings, etc. You can use this to create a FULLTEXT index that behaves like an ngram parser on Mroonga tables: ALTER TABLE table ADD FULLTEXT (value) COMMENT 'tokenizer "TokenBigramSplitSymbolAlphaDigit"'; And then to query you use a typical FULLTEXT search in boolean mode: MATCH (value) AGAINST (? IN BOOLEAN MODE) Altering existing tables to a new storage engine for me takes way too long on tables with over a million rows, so to make it faster you can do: 1. set session alter_algorithm='INPLACE'; 2. set foreign_key_checks=0; 3. create table2 (without indexes) with Mroonga (ENGINE=Mroonga COMMENT='engine "InnoDB"') 4. insert table1 data into table2 5. add indexes, including new FULLTEXT 4. drop all foreign keys that rely on your table 5. rename table table1 to table1_old; 6. rename table table2 to table1; 7. re-add foreign keys 8. set foreign_key_checks=1; Hope this helps for anyone who ended up here.
            lathander Jozef Kováč added a comment -

            Any news when or if ngram parser will be implemented?

            lathander Jozef Kováč added a comment - Any news when or if ngram parser will be implemented?
            AnnaTrees Anna Trees added a comment -

            As more and more hosting providers switch from MySQL to MariaDB, this function is becoming increasingly important. I can't understand why this hasn't been implemented for years. MySQL is also OpenSourse, the code from there could be implemented, even if it doesn't work for all character sets. At least the applications that have worked well with MySQL so far could continue to run. I am sure I speak for many developers who have now been more or less involuntarily confronted with the switch from MySQL to MariaDB.

            AnnaTrees Anna Trees added a comment - As more and more hosting providers switch from MySQL to MariaDB, this function is becoming increasingly important. I can't understand why this hasn't been implemented for years. MySQL is also OpenSourse, the code from there could be implemented, even if it doesn't work for all character sets. At least the applications that have worked well with MySQL so far could continue to run. I am sure I speak for many developers who have now been more or less involuntarily confronted with the switch from MySQL to MariaDB.
            netbroom Dan Sherry added a comment -

            Is there any sort of timeline on this? It's very important for searching within data that doesn't align with traditional FULLTEXT, like URLs and strings containing symbols. Ngram is pretty much the only performant way to do this with larger datasets (millions of rows).

            For example this is what GitHub uses with Code Search: https://github.blog/engineering/the-technology-behind-githubs-new-code-search/

            You can use Mroonga but it's not very stable, has trouble with larger tables, can corrupt tables if it crashes, and support and documentation are spotty, so a native ngram parser would be best.

            I noticed a commit from 5 years ago that is awaiting review. Any sort of update on progress for this feature would be greatly appreciated.

            netbroom Dan Sherry added a comment - Is there any sort of timeline on this? It's very important for searching within data that doesn't align with traditional FULLTEXT, like URLs and strings containing symbols. Ngram is pretty much the only performant way to do this with larger datasets (millions of rows). For example this is what GitHub uses with Code Search: https://github.blog/engineering/the-technology-behind-githubs-new-code-search/ You can use Mroonga but it's not very stable, has trouble with larger tables, can corrupt tables if it crashes, and support and documentation are spotty, so a native ngram parser would be best. I noticed a commit from 5 years ago that is awaiting review. Any sort of update on progress for this feature would be greatly appreciated.

            People

              serg Sergei Golubchik
              ccalender Chris Calender (Inactive)
              Votes:
              19 Vote for this issue
              Watchers:
              28 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.