[MDEV-10267] Add "ngram" support to MariaDB Created: 2016-06-21 Updated: 2024-02-01 |
|
| Status: | Open |
| Project: | MariaDB Server |
| Component/s: | Full-text Search |
| Fix Version/s: | None |
| Type: | New Feature | Priority: | Major |
| Reporter: | Chris Calender (Inactive) | Assignee: | Vicențiu Ciorbaru |
| Resolution: | Unresolved | Votes: | 17 |
| Labels: | Compatibility, compat80 | ||
| Issue Links: |
|
||||||||||||||||
| Description |
|
This is something added in MySQL 5.7, so perhaps it should be included in MariaDB. "ngram and MeCab full-text parser plugins. As of MySQL 5.7.6, MySQL provides a built-in full-text ngram parser plugin that supports Chinese, Japanese, and Korean (CJK). For more information, see Section 13.9.8, “ngram Full-Text Parser” And the link referenced above is: The MeCab task previously described here is now listed separated at MDEV-22987 |
| Comments |
| Comment by Zhenghao He [ 2017-06-22 ] |
|
It's been a year since this issue had been reported, and nothing has been done? China has 300 million Internet users and this n-gram full text indexing functionality is crucial. Because of this issue I have to switch back to MySQL. Somebody please do something. |
| Comment by Gavin Chen [ 2018-07-07 ] |
|
Still not add it? |
| Comment by Andrew Z [ 2018-07-07 ] |
|
not, I tried a week ago |
| Comment by Jeff Xue [ 2018-09-10 ] |
|
today i use the ngram in mariadb 10.2.i found this question.I always thought that mariadb is more versatile than mysql. I wonder why this plugin hasn't been available.tks! |
| Comment by Awais Fiaz [ 2018-11-28 ] |
|
Nothing is till now about it just checked today |
| Comment by jccsxx [ 2019-09-13 ] |
|
It's strongly recommended to add this feature, ngram. I need it too much, otherwise I can only switch back to mysql. |
| Comment by jccsxx [ 2019-12-06 ] |
|
ngram It's too necessary. |
| Comment by LeeChihoon [ 2020-03-09 ] |
|
When will it be solved? |
| Comment by Ian Gilfillan [ 2020-06-23 ] |
|
Implementing ngram will be getting some attention soon. MeCab is now separated from this issue, so if MeCab fulltext searching is of interest, rather than just ngram, please vote on that issue, |
| Comment by Rinat Ibragimov (Inactive) [ 2020-07-22 ] |
|
There is also one feature that was in MySQL version of ngram plugin which is not present in the current version of the patchset in development. It's about stopwords meaning expanded so that all substrings of an ngram are tested against stopword list. So if a stopword list contains word "a", a string "branch" which is normally parsed into list of 3-grams ("bra", "ran", "anc", "nch"), will be parsed into list ("nch"), since other three 3-grams contain "a". Is this feature needed? Can it be useful? The patch from MySQL was hardcoding plugin name check into InnoDB code, so that its behavior changes when "ngram" plugin is used for text parsing. They do that mostly because InnoDB code checks token lengths coming from plugins despite plugin API explicitly mentioning that it's a plugin responsibility to filter out words that are too short or too long. I think I've managed to avoid such explicit name hardcoding by removing length checks for the case when plugins are used for parsing. However, that made it difficult to replicate stopword behavior from the original patch, as this behavior is only needed for "ngram" plugin, and not any other fulltext parsers. So I'd like to ask, if this feature is needed at all. I can't figure out how it may be used other than filtering some unwanted sequences of characters. Again, why? Censoring of some sort, perhaps? |
| Comment by Marko Mäkelä [ 2020-07-24 ] |
|
Thank you! I think that there are some problems with the MySQL 5.7 code that you have been adapting. It seems to me that the currently submitted n-gram parser would not work with UCS2 or UTF-16, even though I would expect those encodings to be used with Chinese, Japanese or Korean, where I would expect the n-gram search to be most useful. It also seems to me that the n-gram size should actually be part of the FULLTEXT INDEX metadata and not a global parameter. Changing that parameter could essentially corrupt all affected indexes or lead to wrong search results. The HA_IOPTION interface for index parameters would seem to be appropriate here. I think that we should also review and test compatibility with tables that were created before the introduction of this feature. Even though MariaDB Server 10.5.4 was already released as GA, I think that this enhancement could be included in the 10.5 series. The compatibility needs to be ensured even if we would postpone this to a later major release. |
| Comment by Ian Gilfillan [ 2020-07-30 ] |
|
marko it seems like changing this to be part of the fulltext index metadata would be quite a lot of extra development. It looks like there are three ways forward: 1) Limit ngram size to 2 only. If needed, there can be other parsers, 3-gram, 4-gram etc. 2) Allow ngram token size from 2-10. The corruption you mention is not real corruption I believe, but simply that the search results would not match. Already, when various fulltext variables are changed, (innodb_ft_min_token_size; innodb_ft_max_token_size; innodb_ft_server_stopword_table; innodb_ft_user_stopword_table or innodb_ft_enable_stopword) the indexes need to be rebuilt, and I believe in the case of innodb_ft_min_token_size and innodb_ft_max_token_size 3) Build token size into fulltext index metadata. Most flexibility, but extensive work to implement. Given the estimates to complete 3), are 1) or 2) viable? |
| Comment by Marko Mäkelä [ 2020-07-31 ] |
|
greenman, thank you for pointing out the existing design problem with the various innodb_ft_ global variables. It could actually be acceptable to continue on the same path, to avoid the extra development at this stage. I think that we should try to allow the ngram token size to be changed freely between 2 and 10. Preferably it should be done without any server restart (maybe with plugin uninstall/install if necessary) so that we can reasonably write a test that changes the parameter between writes and reads. (In my opinion, it is unreasonable to write a regression test that restarts the entire server many times, because a single restart can easily take 1 or 2 seconds of time.) We must ensure that such index corruption will not lead into any crashes. I really hope that we can replace the InnoDB FULLTEXT INDEX implementation in some not too distant major release. https://mariadb.com/resources/blog/initial-impressions-of-innodb-fulltext/ pointed out several design issues that I do not think can be addressed easily. |
| Comment by Dan Sherry [ 2021-02-06 ] |
|
For anyone who is eagerly awaiting this feature and ended up here like myself, you can currently have this functionality with the Mroonga storage engine, and you can use Mroonga in "wrapper mode" with InnoDB to preserve functionality like transactions: https://mroonga.org/docs/tutorial/wrapper.html Ngram parser is extremely useful and removes the need to add a search engine like Sphinx to my stack. Typical FULLTEXT with searching words just is not cutting it for the data I have that contains symbols, URLs, IP address strings, etc. You can use this to create a FULLTEXT index that behaves like an ngram parser on Mroonga tables: ALTER TABLE table ADD FULLTEXT (value) COMMENT 'tokenizer "TokenBigramSplitSymbolAlphaDigit"'; And then to query you use a typical FULLTEXT search in boolean mode: MATCH (value) AGAINST (? IN BOOLEAN MODE) Altering existing tables to a new storage engine for me takes way too long on tables with over a million rows, so to make it faster you can do: Hope this helps for anyone who ended up here. |
| Comment by Jozef Kováč [ 2023-06-07 ] |
|
Any news when or if ngram parser will be implemented? |
| Comment by Anna Trees [ 2024-01-31 ] |
|
As more and more hosting providers switch from MySQL to MariaDB, this function is becoming increasingly important. I can't understand why this hasn't been implemented for years. MySQL is also OpenSourse, the code from there could be implemented, even if it doesn't work for all character sets. At least the applications that have worked well with MySQL so far could continue to run. I am sure I speak for many developers who have now been more or less involuntarily confronted with the switch from MySQL to MariaDB. |