[MDEV-10267] Add "ngram" support to MariaDB - Jira

Details

Type: New Feature
Status: In Review (View Workflow)
Priority: Major
Resolution: Unresolved
Fix Version/s: 12.1
Component/s: Full-text Search
Labels:

Description

This is something added in MySQL 5.7, so perhaps it should be included in MariaDB.

"ngram and MeCab full-text parser plugins. As of MySQL 5.7.6, MySQL provides a built-in full-text ngram parser plugin that supports Chinese, Japanese, and Korean (CJK).

For more information, see Section 13.9.8, “ngram Full-Text Parser”
http://dev.mysql.com/doc/refman/5.7/en/mysql-nutshell.html

And the link referenced above is:
http://dev.mysql.com/doc/refman/5.7/en/fulltext-search-ngram.html

The MeCab task previously described here is now listed separated at ~~MDEV-22987~~

Attachments

Issue Links

relates to

MDEV-22987 Implement MeCab Full-text parser

Closed

MDEV-32578 row_merge_fts_doc_tokenize() handles FTS plugin parser inconsistently

Closed

MDEV-10268 Add "MeCab" support to MariaDB

Open

Activity

Ascending order - Click to sort in descending order

Zhenghao He added a comment - 2017-06-22 07:36

It's been a year since this issue had been reported, and nothing has been done? China has 300 million Internet users and this n-gram full text indexing functionality is crucial. Because of this issue I have to switch back to MySQL. Somebody please do something.

Zhenghao He added a comment - 2017-06-22 07:36 It's been a year since this issue had been reported, and nothing has been done? China has 300 million Internet users and this n-gram full text indexing functionality is crucial. Because of this issue I have to switch back to MySQL. Somebody please do something.

Gavin Chen added a comment - 2018-07-07 14:05

Still not add it?

Gavin Chen added a comment - 2018-07-07 14:05 Still not add it?

Andrew Z added a comment - 2018-07-07 14:24

not, I tried a week ago

Andrew Z added a comment - 2018-07-07 14:24 not, I tried a week ago

Jeff Xue added a comment - 2018-09-10 09:37

today i use the ngram in mariadb 10.2.i found this question.I always thought that mariadb is more versatile than mysql. I wonder why this plugin hasn't been available.tks!

Jeff Xue added a comment - 2018-09-10 09:37 today i use the ngram in mariadb 10.2.i found this question.I always thought that mariadb is more versatile than mysql. I wonder why this plugin hasn't been available.tks!

Awais Fiaz added a comment - 2018-11-28 10:08 - edited

Nothing is till now about it just checked today

Awais Fiaz added a comment - 2018-11-28 10:08 - edited Nothing is till now about it just checked today

jccsxx added a comment - 2019-09-13 01:46

It's strongly recommended to add this feature, ngram. I need it too much, otherwise I can only switch back to mysql.

jccsxx added a comment - 2019-09-13 01:46 It's strongly recommended to add this feature, ngram. I need it too much, otherwise I can only switch back to mysql.

jccsxx added a comment - 2019-12-06 02:41

ngram It's too necessary.

jccsxx added a comment - 2019-12-06 02:41 ngram It's too necessary.

LeeChihoon added a comment - 2020-03-09 04:46

When will it be solved?

LeeChihoon added a comment - 2020-03-09 04:46 When will it be solved?

Ian Gilfillan added a comment - 2020-06-23 09:28

Implementing ngram will be getting some attention soon. MeCab is now separated from this issue, so if MeCab fulltext searching is of interest, rather than just ngram, please vote on that issue, ~~MDEV-22987~~

Ian Gilfillan added a comment - 2020-06-23 09:28 Implementing ngram will be getting some attention soon. MeCab is now separated from this issue, so if MeCab fulltext searching is of interest, rather than just ngram, please vote on that issue, MDEV-22987

Rinat Ibragimov (Inactive) added a comment - 2020-07-22 23:54

There is also one feature that was in MySQL version of ngram plugin which is not present in the current version of the patchset in development. It's about stopwords meaning expanded so that all substrings of an ngram are tested against stopword list. So if a stopword list contains word "a", a string "branch" which is normally parsed into list of 3-grams ("bra", "ran", "anc", "nch"), will be parsed into list ("nch"), since other three 3-grams contain "a". Is this feature needed? Can it be useful?

The patch from MySQL was hardcoding plugin name check into InnoDB code, so that its behavior changes when "ngram" plugin is used for text parsing. They do that mostly because InnoDB code checks token lengths coming from plugins despite plugin API explicitly mentioning that it's a plugin responsibility to filter out words that are too short or too long. I think I've managed to avoid such explicit name hardcoding by removing length checks for the case when plugins are used for parsing. However, that made it difficult to replicate stopword behavior from the original patch, as this behavior is only needed for "ngram" plugin, and not any other fulltext parsers.

So I'd like to ask, if this feature is needed at all. I can't figure out how it may be used other than filtering some unwanted sequences of characters. Again, why? Censoring of some sort, perhaps?

Rinat Ibragimov (Inactive) added a comment - 2020-07-22 23:54 There is also one feature that was in MySQL version of ngram plugin which is not present in the current version of the patchset in development. It's about stopwords meaning expanded so that all substrings of an ngram are tested against stopword list. So if a stopword list contains word "a", a string "branch" which is normally parsed into list of 3-grams ("bra", "ran", "anc", "nch"), will be parsed into list ("nch"), since other three 3-grams contain "a". Is this feature needed? Can it be useful? The patch from MySQL was hardcoding plugin name check into InnoDB code, so that its behavior changes when "ngram" plugin is used for text parsing. They do that mostly because InnoDB code checks token lengths coming from plugins despite plugin API explicitly mentioning that it's a plugin responsibility to filter out words that are too short or too long. I think I've managed to avoid such explicit name hardcoding by removing length checks for the case when plugins are used for parsing. However, that made it difficult to replicate stopword behavior from the original patch, as this behavior is only needed for "ngram" plugin, and not any other fulltext parsers. So I'd like to ask, if this feature is needed at all. I can't figure out how it may be used other than filtering some unwanted sequences of characters. Again, why? Censoring of some sort, perhaps?

Marko Mäkelä added a comment - 2020-07-24 06:34

Thank you! I think that there are some problems with the MySQL 5.7 code that you have been adapting.

It seems to me that the currently submitted n-gram parser would not work with UCS2 or UTF-16, even though I would expect those encodings to be used with Chinese, Japanese or Korean, where I would expect the n-gram search to be most useful.

It also seems to me that the n-gram size should actually be part of the FULLTEXT INDEX metadata and not a global parameter. Changing that parameter could essentially corrupt all affected indexes or lead to wrong search results. The HA_IOPTION interface for index parameters would seem to be appropriate here.

I think that we should also review and test compatibility with tables that were created before the introduction of this feature.

Even though MariaDB Server 10.5.4 was already released as GA, I think that this enhancement could be included in the 10.5 series. The compatibility needs to be ensured even if we would postpone this to a later major release.

Marko Mäkelä added a comment - 2020-07-24 06:34 Thank you! I think that there are some problems with the MySQL 5.7 code that you have been adapting. It seems to me that the currently submitted n-gram parser would not work with UCS2 or UTF-16, even though I would expect those encodings to be used with Chinese, Japanese or Korean, where I would expect the n-gram search to be most useful. It also seems to me that the n-gram size should actually be part of the FULLTEXT INDEX metadata and not a global parameter. Changing that parameter could essentially corrupt all affected indexes or lead to wrong search results. The HA_IOPTION interface for index parameters would seem to be appropriate here. I think that we should also review and test compatibility with tables that were created before the introduction of this feature. Even though MariaDB Server 10.5.4 was already released as GA, I think that this enhancement could be included in the 10.5 series. The compatibility needs to be ensured even if we would postpone this to a later major release.

Ian Gilfillan added a comment - 2020-07-30 22:43

marko it seems like changing this to be part of the fulltext index metadata would be quite a lot of extra development.

It looks like there are three ways forward:

1) Limit ngram size to 2 only. If needed, there can be other parsers, 3-gram, 4-gram etc.

2) Allow ngram token size from 2-10. The corruption you mention is not real corruption I believe, but simply that the search results would not match. Already, when various fulltext variables are changed, (innodb_ft_min_token_size; innodb_ft_max_token_size; innodb_ft_server_stopword_table; innodb_ft_user_stopword_table or innodb_ft_enable_stopword) the indexes need to be rebuilt, and I believe in the case of innodb_ft_min_token_size and innodb_ft_max_token_size
the server needs to be restarted? This then becomes a documentation issue, letting users know that changing the setting requires a rebuild/restart (this currently isn't documented for the other variables either)

3) Build token size into fulltext index metadata. Most flexibility, but extensive work to implement.

Given the estimates to complete 3), are 1) or 2) viable?

Ian Gilfillan added a comment - 2020-07-30 22:43 marko it seems like changing this to be part of the fulltext index metadata would be quite a lot of extra development. It looks like there are three ways forward: 1) Limit ngram size to 2 only. If needed, there can be other parsers, 3-gram, 4-gram etc. 2) Allow ngram token size from 2-10. The corruption you mention is not real corruption I believe, but simply that the search results would not match. Already, when various fulltext variables are changed, (innodb_ft_min_token_size; innodb_ft_max_token_size; innodb_ft_server_stopword_table; innodb_ft_user_stopword_table or innodb_ft_enable_stopword) the indexes need to be rebuilt, and I believe in the case of innodb_ft_min_token_size and innodb_ft_max_token_size the server needs to be restarted? This then becomes a documentation issue, letting users know that changing the setting requires a rebuild/restart (this currently isn't documented for the other variables either) 3) Build token size into fulltext index metadata. Most flexibility, but extensive work to implement. Given the estimates to complete 3), are 1) or 2) viable?

Marko Mäkelä added a comment - 2020-07-31 03:15

greenman, thank you for pointing out the existing design problem with the various innodb_ft_ global variables. It could actually be acceptable to continue on the same path, to avoid the extra development at this stage. I think that we should try to allow the ngram token size to be changed freely between 2 and 10. Preferably it should be done without any server restart (maybe with plugin uninstall/install if necessary) so that we can reasonably write a test that changes the parameter between writes and reads. (In my opinion, it is unreasonable to write a regression test that restarts the entire server many times, because a single restart can easily take 1 or 2 seconds of time.) We must ensure that such index corruption will not lead into any crashes.

I really hope that we can replace the InnoDB FULLTEXT INDEX implementation in some not too distant major release. https://mariadb.com/resources/blog/initial-impressions-of-innodb-fulltext/ pointed out several design issues that I do not think can be addressed easily.

Marko Mäkelä added a comment - 2020-07-31 03:15 greenman , thank you for pointing out the existing design problem with the various innodb_ft_ global variables. It could actually be acceptable to continue on the same path, to avoid the extra development at this stage. I think that we should try to allow the ngram token size to be changed freely between 2 and 10. Preferably it should be done without any server restart (maybe with plugin uninstall/install if necessary) so that we can reasonably write a test that changes the parameter between writes and reads. (In my opinion, it is unreasonable to write a regression test that restarts the entire server many times, because a single restart can easily take 1 or 2 seconds of time.) We must ensure that such index corruption will not lead into any crashes. I really hope that we can replace the InnoDB FULLTEXT INDEX implementation in some not too distant major release. https://mariadb.com/resources/blog/initial-impressions-of-innodb-fulltext/ pointed out several design issues that I do not think can be addressed easily.

Dan Sherry added a comment - 2021-02-06 04:23

For anyone who is eagerly awaiting this feature and ended up here like myself, you can currently have this functionality with the Mroonga storage engine, and you can use Mroonga in "wrapper mode" with InnoDB to preserve functionality like transactions: https://mroonga.org/docs/tutorial/wrapper.html

Ngram parser is extremely useful and removes the need to add a search engine like Sphinx to my stack. Typical FULLTEXT with searching words just is not cutting it for the data I have that contains symbols, URLs, IP address strings, etc.

You can use this to create a FULLTEXT index that behaves like an ngram parser on Mroonga tables: ALTER TABLE table ADD FULLTEXT (value) COMMENT 'tokenizer "TokenBigramSplitSymbolAlphaDigit"';

And then to query you use a typical FULLTEXT search in boolean mode: MATCH (value) AGAINST (? IN BOOLEAN MODE)

Altering existing tables to a new storage engine for me takes way too long on tables with over a million rows, so to make it faster you can do:
1. set session alter_algorithm='INPLACE';
2. set foreign_key_checks=0;
3. create table2 (without indexes) with Mroonga (ENGINE=Mroonga COMMENT='engine "InnoDB"')
4. insert table1 data into table2
5. add indexes, including new FULLTEXT
4. drop all foreign keys that rely on your table
5. rename table table1 to table1_old;
6. rename table table2 to table1;
7. re-add foreign keys
8. set foreign_key_checks=1;

Hope this helps for anyone who ended up here.

Dan Sherry added a comment - 2021-02-06 04:23 For anyone who is eagerly awaiting this feature and ended up here like myself, you can currently have this functionality with the Mroonga storage engine, and you can use Mroonga in "wrapper mode" with InnoDB to preserve functionality like transactions: https://mroonga.org/docs/tutorial/wrapper.html Ngram parser is extremely useful and removes the need to add a search engine like Sphinx to my stack. Typical FULLTEXT with searching words just is not cutting it for the data I have that contains symbols, URLs, IP address strings, etc. You can use this to create a FULLTEXT index that behaves like an ngram parser on Mroonga tables: ALTER TABLE table ADD FULLTEXT (value) COMMENT 'tokenizer "TokenBigramSplitSymbolAlphaDigit"'; And then to query you use a typical FULLTEXT search in boolean mode: MATCH (value) AGAINST (? IN BOOLEAN MODE) Altering existing tables to a new storage engine for me takes way too long on tables with over a million rows, so to make it faster you can do: 1. set session alter_algorithm='INPLACE'; 2. set foreign_key_checks=0; 3. create table2 (without indexes) with Mroonga (ENGINE=Mroonga COMMENT='engine "InnoDB"') 4. insert table1 data into table2 5. add indexes, including new FULLTEXT 4. drop all foreign keys that rely on your table 5. rename table table1 to table1_old; 6. rename table table2 to table1; 7. re-add foreign keys 8. set foreign_key_checks=1; Hope this helps for anyone who ended up here.

Jozef Kováč added a comment - 2023-06-07 10:16

Any news when or if ngram parser will be implemented?

Jozef Kováč added a comment - 2023-06-07 10:16 Any news when or if ngram parser will be implemented?

Anna Trees added a comment - 2024-01-31 10:26

As more and more hosting providers switch from MySQL to MariaDB, this function is becoming increasingly important. I can't understand why this hasn't been implemented for years. MySQL is also OpenSourse, the code from there could be implemented, even if it doesn't work for all character sets. At least the applications that have worked well with MySQL so far could continue to run. I am sure I speak for many developers who have now been more or less involuntarily confronted with the switch from MySQL to MariaDB.

Anna Trees added a comment - 2024-01-31 10:26 As more and more hosting providers switch from MySQL to MariaDB, this function is becoming increasingly important. I can't understand why this hasn't been implemented for years. MySQL is also OpenSourse, the code from there could be implemented, even if it doesn't work for all character sets. At least the applications that have worked well with MySQL so far could continue to run. I am sure I speak for many developers who have now been more or less involuntarily confronted with the switch from MySQL to MariaDB.

Dan Sherry added a comment - 2025-02-05 21:15

Is there any sort of timeline on this? It's very important for searching within data that doesn't align with traditional FULLTEXT, like URLs and strings containing symbols. Ngram is pretty much the only performant way to do this with larger datasets (millions of rows).

For example this is what GitHub uses with Code Search: https://github.blog/engineering/the-technology-behind-githubs-new-code-search/

You can use Mroonga but it's not very stable, has trouble with larger tables, can corrupt tables if it crashes, and support and documentation are spotty, so a native ngram parser would be best.

I noticed a commit from 5 years ago that is awaiting review. Any sort of update on progress for this feature would be greatly appreciated.

Dan Sherry added a comment - 2025-02-05 21:15 Is there any sort of timeline on this? It's very important for searching within data that doesn't align with traditional FULLTEXT, like URLs and strings containing symbols. Ngram is pretty much the only performant way to do this with larger datasets (millions of rows). For example this is what GitHub uses with Code Search: https://github.blog/engineering/the-technology-behind-githubs-new-code-search/ You can use Mroonga but it's not very stable, has trouble with larger tables, can corrupt tables if it crashes, and support and documentation are spotty, so a native ngram parser would be best. I noticed a commit from 5 years ago that is awaiting review. Any sort of update on progress for this feature would be greatly appreciated.

People

Assignee:: Sergei Golubchik

Reporter:: Chris Calender (Inactive)

Votes:: 19 Vote for this issue

Watchers:: 28 Start watching this issue

Dates

Due:: 2025-04-01

Created:: 2016-06-21 16:16

Updated:: 3 days ago 13:41

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.