[MDEV-20797] FULLTEXT search with apostrophe, and mandatory words - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Critical
Resolution: Fixed
Affects Version/s: 10.4.8, 10.0(EOL), 10.1(EOL), 10.2(EOL), 10.3(EOL), 10.4(EOL), 10.5
Fix Version/s: 10.3.36, 10.4.26, 10.5.17, 10.6.9, 10.7.5
Component/s: Full-text Search, Storage Engine - InnoDB
Labels:
None
Environment:
Fedora 30

Description

Aria and MyISAM appear to work, but not InnoDB:

CREATE TABLE customer (

    name TINYTEXT NOT NULL,

    FULLTEXT (name)

) ENGINE = InnoDB;

INSERT INTO customer VALUES ('O''Brien');

SELECT * FROM customer WHERE MATCH (name) AGAINST ("+O'Brien" IN BOOLEAN MODE);

Empty set (0.000 sec)

CREATE TABLE customer2 (

    name TINYTEXT NOT NULL,

    FULLTEXT (name)

) ENGINE = Aria;

INSERT INTO customer2 VALUES ('O''Brien');

SELECT * FROM customer2 WHERE MATCH (name) AGAINST ("+O'Brien" IN BOOLEAN MODE);

1 row in set (0.001 sec)

CREATE TABLE customer3 (

    name TINYTEXT NOT NULL,

    FULLTEXT (name)

) ENGINE = MyISAM;

INSERT INTO customer3 VALUES ('O''Brien');

SELECT * FROM customer3 WHERE MATCH (name) AGAINST ("+O'Brien" IN BOOLEAN MODE);

1 row in set (0.001 sec)

Problem first reported on StackExchange:
https://dba.stackexchange.com/questions/250524/mariadb-fulltext-search-with-apostrophe-and-mandatory-words

Attachments

Activity

Ascending order - Click to sort in descending order

Alice Sherepa added a comment - 2019-10-10 15:58

Thanks a lot! Reproduced as described on 10.0-10.4:

--source include/have_innodb.inc

CREATE TABLE t1 (name TEXT, FULLTEXT (name)) ENGINE = InnoDB;

INSERT INTO t1 VALUES ('O''Brien');

SELECT * FROM t1 WHERE MATCH (name) AGAINST ("+O'Brien" IN BOOLEAN MODE);

ALTER TABLE t1 ENGINE=MyISAM;

SELECT * FROM t1 WHERE MATCH (name) AGAINST ("+O'Brien" IN BOOLEAN MODE);

DROP TABLE t1;

Alice Sherepa added a comment - 2019-10-10 15:58 Thanks a lot! Reproduced as described on 10.0-10.4: --source include/have_innodb.inc CREATE TABLE t1 ( name TEXT, FULLTEXT ( name )) ENGINE = InnoDB; INSERT INTO t1 VALUES ( 'O' 'Brien' ); SELECT * FROM t1 WHERE MATCH ( name ) AGAINST ( "+O'Brien" IN BOOLEAN MODE); ALTER TABLE t1 ENGINE=MyISAM; SELECT * FROM t1 WHERE MATCH ( name ) AGAINST ( "+O'Brien" IN BOOLEAN MODE); DROP TABLE t1;

Rick James (Inactive) added a comment - 2019-10-15 03:32

Please reverify with the + outside the "", as in ... AGAINST (+"O'Brien" ...

Please reverify with less than 50% of the table containing the desired word.

Rick James (Inactive) added a comment - 2019-10-15 03:32 Please reverify with the + outside the "", as in ... AGAINST (+"O'Brien" ... Please reverify with less than 50% of the table containing the desired word.

Craig Francis added a comment - 2019-10-15 11:36

@rjasdf, moving the "+" outside of the quoted string seems to make the query behave as though the + wasn't there, and in the original example this was for a word that was used less than 50%.

I've just updated the original Stack Exchange question to provide a better example:

    CREATE TABLE customer (

        name TINYTEXT NOT NULL,

        FULLTEXT (name)

    ) ENGINE = InnoDB;

    INSERT INTO customer VALUES

        ('O''Brien'),

        ('O Brien'),

        ('X''Brien'),

        ('Extra Amy'),

        ('Extra Brian'),

        ('Extra Cat'),

        ('Extra Debbie');

    SELECT * FROM customer WHERE MATCH (name) AGAINST (+"O'Brien" IN BOOLEAN MODE);

    "O'Brien"

    "O Brien"

    "X'Brien"

Craig Francis added a comment - 2019-10-15 11:36 @rjasdf, moving the "+" outside of the quoted string seems to make the query behave as though the + wasn't there, and in the original example this was for a word that was used less than 50%. I've just updated the original Stack Exchange question to provide a better example: CREATE TABLE customer ( name TINYTEXT NOT NULL , FULLTEXT ( name ) ) ENGINE = InnoDB; INSERT INTO customer VALUES ( 'O' 'Brien' ), ( 'O Brien' ), ( 'X' 'Brien' ), ( 'Extra Amy' ), ( 'Extra Brian' ), ( 'Extra Cat' ), ( 'Extra Debbie' ); SELECT * FROM customer WHERE MATCH ( name ) AGAINST (+ "O'Brien" IN BOOLEAN MODE); "O'Brien" "O Brien" "X'Brien"

Yannik H added a comment - 2020-09-01 00:34

I got bitten by this today and I noticed this bug has been open for a while, so I took the time to do some investigation. I'm not at all familiar with the MariaDB codebase, so apologies in advance for my childlike innocence.

Near as I can tell from looking at the source code, O'Brien is tokenised as O and Brien, which means that "+O'Brien" wouldn't ever match, but "O'Brien" would because both the column index and the SELECT query are being tokenised the same way so that Brien matches Brien. In "+O'Brien" it's the "O" part that's being forced, but being so short it will never match, so you get zero rows (but "+Brien" does return rows containing O'Brien).

However even the latest MySQL docs refer to the ability to include single apostrophes in words, so I've undertaken some archaeology to work out where this went wrong, and I think it's here: https://bugs.mysql.com/bug.php?id=14194 (and its commit https://lists.mysql.com/commits/1056)

The decision to not consider an apostrophe to be part of a word was intentional then, and for what sounds like a good reason in Romance languages. The MySQL docs are just out of date. But that decision means that we can't search for O'Brien as a word in itself.

I'm not sure how to fix this easily so that we can have the best of both worlds. I suppose ideally O'Brien would be tokenised as two words (O'Brien and Brien) to work for both use cases, but that may be a difficult change with the way the code currently works. We could also special-case O'xxx, which would work for Irish patronymics without breaking common nouns in Romance languages (as far as I know). This would be an easy change to innobase_mysql_fts_get_token (and friends), and it would solve all the examples I can think of, but there may be a language I'm not aware of which contracts common nouns using O.

Alternately, instead of changing the way that words get tokenised, we could change the parsing of the fulltext query so that "+O'Brien" gets interpreted as "+Brien" instead of "+O" as it currently does. This would involve changing fts_get_word (and friends) to ignore a single true_word_char followed by an apostrophe, so that it would strip the "O'" and see the "Brien" part instead when parsing the fulltext query, but not when tokenising words for the index itself. This would also be simple to change, but it would mean that a search for "+O'Brien" would also return rows with Brien (without a O'). Still, this is much better than returning no rows at all, which is the current result.

So there's no perfect solution, but there are some simple ways to improve the situation. I'm happy to submit a patch for one of these options if someone points me in the right direction, but seeing as I'm not familiar with MariaDB's code base at all, it may be better if someone else makes a decision here and writes the change (and verifies that I'm actually correct about all this).

Yannik H added a comment - 2020-09-01 00:34 I got bitten by this today and I noticed this bug has been open for a while, so I took the time to do some investigation. I'm not at all familiar with the MariaDB codebase, so apologies in advance for my childlike innocence. Near as I can tell from looking at the source code, O'Brien is tokenised as O and Brien, which means that "+O'Brien" wouldn't ever match, but "O'Brien" would because both the column index and the SELECT query are being tokenised the same way so that Brien matches Brien. In "+O'Brien" it's the "O" part that's being forced, but being so short it will never match, so you get zero rows (but "+Brien" does return rows containing O'Brien). However even the latest MySQL docs refer to the ability to include single apostrophes in words, so I've undertaken some archaeology to work out where this went wrong, and I think it's here: https://bugs.mysql.com/bug.php?id=14194 (and its commit https://lists.mysql.com/commits/1056 ) The decision to not consider an apostrophe to be part of a word was intentional then, and for what sounds like a good reason in Romance languages. The MySQL docs are just out of date. But that decision means that we can't search for O'Brien as a word in itself. I'm not sure how to fix this easily so that we can have the best of both worlds. I suppose ideally O'Brien would be tokenised as two words (O'Brien and Brien) to work for both use cases, but that may be a difficult change with the way the code currently works. We could also special-case O'xxx, which would work for Irish patronymics without breaking common nouns in Romance languages (as far as I know). This would be an easy change to innobase_mysql_fts_get_token (and friends), and it would solve all the examples I can think of, but there may be a language I'm not aware of which contracts common nouns using O. Alternately, instead of changing the way that words get tokenised, we could change the parsing of the fulltext query so that "+O'Brien" gets interpreted as "+Brien" instead of "+O" as it currently does. This would involve changing fts_get_word (and friends) to ignore a single true_word_char followed by an apostrophe, so that it would strip the "O'" and see the "Brien" part instead when parsing the fulltext query, but not when tokenising words for the index itself. This would also be simple to change, but it would mean that a search for "+O'Brien" would also return rows with Brien (without a O'). Still, this is much better than returning no rows at all, which is the current result. So there's no perfect solution, but there are some simple ways to improve the situation. I'm happy to submit a patch for one of these options if someone points me in the right direction, but seeing as I'm not familiar with MariaDB's code base at all, it may be better if someone else makes a decision here and writes the change (and verifies that I'm actually correct about all this).

Marko Mäkelä added a comment - 2020-12-15 13:37

yannik, thank you for your comment. I wonder if ’ would be treated differently from '. Of course, the column would have to be in UTF-8 collation, and you would have to ensure that the client character set is correct (that would be SET NAMES utf8 in our test files).

Changing the way how things are indexed constitutes a change of file format. In the early days of MySQL (probably before bugs.mysql.com was created and a little before my time, which started in 2003), I remember that someone got the idea of changing some collation tables. That would make existing indexes look corrupted. That mistake was not repeated ever again, I hope.

While the change of tokenization would change the format of the inverted indexes, it might be acceptable. We already faced a similar question in MDEV-10267 (introducing n-gram support): the parameter N is apparently not part of the data dictionary, but a global configuration parameter that affects the way how all data is being interpreted. It might be acceptable to break the format. We merely have to test what happens with each possible tokenization and query. The server must not crash, and it must report an acceptable result, both after upgrading to the fixed version and downgrading to the not-yet-fixed version.

Because you have already dug so deep, I would invite you to follow the steps in https://mariadb.org/contribute/.

Marko Mäkelä added a comment - 2020-12-15 13:37 yannik , thank you for your comment. I wonder if ’ would be treated differently from '. Of course, the column would have to be in UTF-8 collation, and you would have to ensure that the client character set is correct (that would be SET NAMES utf8 in our test files). Changing the way how things are indexed constitutes a change of file format. In the early days of MySQL (probably before bugs.mysql.com was created and a little before my time, which started in 2003), I remember that someone got the idea of changing some collation tables. That would make existing indexes look corrupted. That mistake was not repeated ever again, I hope. While the change of tokenization would change the format of the inverted indexes, it might be acceptable. We already faced a similar question in MDEV-10267 (introducing n-gram support): the parameter N is apparently not part of the data dictionary, but a global configuration parameter that affects the way how all data is being interpreted. It might be acceptable to break the format. We merely have to test what happens with each possible tokenization and query. The server must not crash, and it must report an acceptable result, both after upgrading to the fixed version and downgrading to the not-yet-fixed version. Because you have already dug so deep, I would invite you to follow the steps in https://mariadb.org/contribute/ .

Marko Mäkelä added a comment - 2022-07-25 09:03

thiru, I think that some more explanation to the proposed fix is needed, both in the commit message and in test cases.

What will the function innobase_mysql_fts_get_token() be invoked on? Only on the contents of full-text indexed columns, or also on a query string?

Can you extend the test case? Insert also the rows Brien, O 'Brien and O' Brien. Will they be found in the same way as with ENGINE=MyISAM or ENGINE=Aria? What about Doh'nuts?

Marko Mäkelä added a comment - 2022-07-25 09:03 thiru , I think that some more explanation to the proposed fix is needed, both in the commit message and in test cases. What will the function innobase_mysql_fts_get_token() be invoked on? Only on the contents of full-text indexed columns, or also on a query string? Can you extend the test case? Insert also the rows Brien , O 'Brien and O' Brien . Will they be found in the same way as with ENGINE=MyISAM or ENGINE=Aria? What about Doh'nuts ?

Pere added a comment - 2022-08-30 15:20

I think this introduces a regression.

After this change, how can we search for "o'brien" without getting "brien"-only results? I tried using quotes, but that didn't work either.

Pere added a comment - 2022-08-30 15:20 I think this introduces a regression. After this change, how can we search for "o'brien" without getting "brien"-only results? I tried using quotes, but that didn't work either.

People

Assignee:: Thirunarayanan Balathandayuthapani

Reporter:: Karl Levik

Votes:: 2 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 2019-10-10 11:30

Updated:: 2024-07-07 22:33

Resolved:: 2022-07-26 10:20

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server