[MDEV-10267] Add "ngram" support to MariaDB - Jira

Details

Type: New Feature
Status: In Review (View Workflow)
Priority: Major
Resolution: Unresolved
Fix Version/s: 12.1
Component/s: Full-text Search
Labels:

Description

This is something added in MySQL 5.7, so perhaps it should be included in MariaDB.

"ngram and MeCab full-text parser plugins. As of MySQL 5.7.6, MySQL provides a built-in full-text ngram parser plugin that supports Chinese, Japanese, and Korean (CJK).

For more information, see Section 13.9.8, “ngram Full-Text Parser”
http://dev.mysql.com/doc/refman/5.7/en/mysql-nutshell.html

And the link referenced above is:
http://dev.mysql.com/doc/refman/5.7/en/fulltext-search-ngram.html

The MeCab task previously described here is now listed separated at ~~MDEV-22987~~

Attachments

Issue Links

relates to

MDEV-22987 Implement MeCab Full-text parser

Closed

MDEV-32578 row_merge_fts_doc_tokenize() handles FTS plugin parser inconsistently

Closed

MDEV-10268 Add "MeCab" support to MariaDB

Open

links to

MDEV-10267 Add "ngram" support to MariaDB #1624

Activity

Ascending order - Click to sort in descending order

View 13 older comments

Dan Sherry added a comment - 2021-02-06 04:23

For anyone who is eagerly awaiting this feature and ended up here like myself, you can currently have this functionality with the Mroonga storage engine, and you can use Mroonga in "wrapper mode" with InnoDB to preserve functionality like transactions: https://mroonga.org/docs/tutorial/wrapper.html

Ngram parser is extremely useful and removes the need to add a search engine like Sphinx to my stack. Typical FULLTEXT with searching words just is not cutting it for the data I have that contains symbols, URLs, IP address strings, etc.

You can use this to create a FULLTEXT index that behaves like an ngram parser on Mroonga tables: ALTER TABLE table ADD FULLTEXT (value) COMMENT 'tokenizer "TokenBigramSplitSymbolAlphaDigit"';

And then to query you use a typical FULLTEXT search in boolean mode: MATCH (value) AGAINST (? IN BOOLEAN MODE)

Altering existing tables to a new storage engine for me takes way too long on tables with over a million rows, so to make it faster you can do:
1. set session alter_algorithm='INPLACE';
2. set foreign_key_checks=0;
3. create table2 (without indexes) with Mroonga (ENGINE=Mroonga COMMENT='engine "InnoDB"')
4. insert table1 data into table2
5. add indexes, including new FULLTEXT
4. drop all foreign keys that rely on your table
5. rename table table1 to table1_old;
6. rename table table2 to table1;
7. re-add foreign keys
8. set foreign_key_checks=1;

Hope this helps for anyone who ended up here.

Dan Sherry added a comment - 2021-02-06 04:23 For anyone who is eagerly awaiting this feature and ended up here like myself, you can currently have this functionality with the Mroonga storage engine, and you can use Mroonga in "wrapper mode" with InnoDB to preserve functionality like transactions: https://mroonga.org/docs/tutorial/wrapper.html Ngram parser is extremely useful and removes the need to add a search engine like Sphinx to my stack. Typical FULLTEXT with searching words just is not cutting it for the data I have that contains symbols, URLs, IP address strings, etc. You can use this to create a FULLTEXT index that behaves like an ngram parser on Mroonga tables: ALTER TABLE table ADD FULLTEXT (value) COMMENT 'tokenizer "TokenBigramSplitSymbolAlphaDigit"'; And then to query you use a typical FULLTEXT search in boolean mode: MATCH (value) AGAINST (? IN BOOLEAN MODE) Altering existing tables to a new storage engine for me takes way too long on tables with over a million rows, so to make it faster you can do: 1. set session alter_algorithm='INPLACE'; 2. set foreign_key_checks=0; 3. create table2 (without indexes) with Mroonga (ENGINE=Mroonga COMMENT='engine "InnoDB"') 4. insert table1 data into table2 5. add indexes, including new FULLTEXT 4. drop all foreign keys that rely on your table 5. rename table table1 to table1_old; 6. rename table table2 to table1; 7. re-add foreign keys 8. set foreign_key_checks=1; Hope this helps for anyone who ended up here.

Jozef Kováč added a comment - 2023-06-07 10:16

Any news when or if ngram parser will be implemented?

Jozef Kováč added a comment - 2023-06-07 10:16 Any news when or if ngram parser will be implemented?

Anna Trees added a comment - 2024-01-31 10:26

As more and more hosting providers switch from MySQL to MariaDB, this function is becoming increasingly important. I can't understand why this hasn't been implemented for years. MySQL is also OpenSourse, the code from there could be implemented, even if it doesn't work for all character sets. At least the applications that have worked well with MySQL so far could continue to run. I am sure I speak for many developers who have now been more or less involuntarily confronted with the switch from MySQL to MariaDB.

Anna Trees added a comment - 2024-01-31 10:26 As more and more hosting providers switch from MySQL to MariaDB, this function is becoming increasingly important. I can't understand why this hasn't been implemented for years. MySQL is also OpenSourse, the code from there could be implemented, even if it doesn't work for all character sets. At least the applications that have worked well with MySQL so far could continue to run. I am sure I speak for many developers who have now been more or less involuntarily confronted with the switch from MySQL to MariaDB.

Dan Sherry added a comment - 2025-02-05 21:15

Is there any sort of timeline on this? It's very important for searching within data that doesn't align with traditional FULLTEXT, like URLs and strings containing symbols. Ngram is pretty much the only performant way to do this with larger datasets (millions of rows).

For example this is what GitHub uses with Code Search: https://github.blog/engineering/the-technology-behind-githubs-new-code-search/

You can use Mroonga but it's not very stable, has trouble with larger tables, can corrupt tables if it crashes, and support and documentation are spotty, so a native ngram parser would be best.

I noticed a commit from 5 years ago that is awaiting review. Any sort of update on progress for this feature would be greatly appreciated.

Dan Sherry added a comment - 2025-02-05 21:15 Is there any sort of timeline on this? It's very important for searching within data that doesn't align with traditional FULLTEXT, like URLs and strings containing symbols. Ngram is pretty much the only performant way to do this with larger datasets (millions of rows). For example this is what GitHub uses with Code Search: https://github.blog/engineering/the-technology-behind-githubs-new-code-search/ You can use Mroonga but it's not very stable, has trouble with larger tables, can corrupt tables if it crashes, and support and documentation are spotty, so a native ngram parser would be best. I noticed a commit from 5 years ago that is awaiting review. Any sort of update on progress for this feature would be greatly appreciated.

Sergey Vojtovich added a comment - 2 days ago

serg, why was contribution label removed from this issue? Pull request is still there and is marked as an "External Contribution".

Sergey Vojtovich added a comment - 2 days ago serg , why was contribution label removed from this issue? Pull request is still there and is marked as an "External Contribution".

MariaDB Server

Add "ngram" support to MariaDB

Details

Description

Attachments

Issue Links

Activity

People

Dates

Git Integration