[MDEV-12871] IN BOOLEAN MODE does not overcome 50% limit as claimed Created: 2017-05-22  Updated: 2017-05-27  Resolved: 2017-05-26

Status: Closed
Project: MariaDB Server
Component/s: Full-text Search, Storage Engine - MyISAM
Affects Version/s: 10.1.22
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: M MacDonald Assignee: Sergei Golubchik
Resolution: Not a Bug Votes: 0
Labels: None
Environment:

FreeBSD 10.3, 64-bit, 2017Q2



 Description   

Experimenting with FULLTEXT, it took me awhile to make a match across columns work at all because the examples in the documentation are incomplete and I didn't know how to set up to use that kind of matching (I'm not even sure what the FULLTEXT keyword's technical role is, since it doesn't appear in the output of DESCRIBE).

Because I need to match keywords that may appear in more than half the records, I was matching IN BOOLEAN MODE. But I was getting "empty set" back. So I tried matching on a rare term and, of course, got a hit back. Clearly, the docs are wrong or (I hope) there's a bug in the IN BOOLEAN MODE qualifier.



 Comments   
Comment by M MacDonald [ 2017-05-23 ]

I'd close this issue, but this isn't bugzilla, so I don't yet know how to do things.

With more experimentation I got the multicolumn match to work (the docs should be fixed), but even matching IN BOOLEAN MODE doesn't overcome the 50% limitation as is claimed in the docs, returning hits only on terms that appear in fewer than half the records.

Comment by Elena Stepanova [ 2017-05-23 ]

If you refer to documentation at mariadb.com/kb, please note that it's a community wiki, everyone with a login can edit it, e.g. add examples or fix errors. Or, if you don't feel like doing it yourself, you can suggest changes here, we'll re-qualify it as documentation.

For the rest, could you please provide an actual example with structures, data, query, and expected vs actual result?

Comment by M MacDonald [ 2017-05-24 ]

Okay, I'll try patching the doc when I get a moment.

Here's an example. "oneway" (appearing in 1 of 5 records) is found, but "name" (in 4 of 5) is not. Behavior with or without the "in boolean mode" qualifier is identical.

My CREATE TABLE statement (not shown here) included FULLTEXT keys for all the tag columns individually, and a separate FULLTEXT key for all the tag columns as a group.

MariaDB [mapdata]> describe tst ;
+-------+--------------+------+-----+---------+-------+
| Field | Type         | Null | Key | Default | Extra |
+-------+--------------+------+-----+---------+-------+
| Idx   | int(11)      | NO   | MUL | 0       |       |
| Tag00 | varchar(500) | YES  | MUL | NULL    |       |
| Tag01 | varchar(500) | YES  | MUL | NULL    |       |
| Tag02 | varchar(500) | YES  | MUL | NULL    |       |
| Tag03 | varchar(500) | YES  | MUL | NULL    |       |
| Tag04 | varchar(500) | YES  | MUL | NULL    |       |
| Tag05 | varchar(500) | YES  | MUL | NULL    |       |
| Tag06 | varchar(500) | YES  | MUL | NULL    |       |
| Tag07 | varchar(500) | YES  | MUL | NULL    |       |
| Tag08 | varchar(500) | YES  | MUL | NULL    |       |
| Tag09 | varchar(500) | YES  | MUL | NULL    |       |
| Tag10 | varchar(500) | YES  | MUL | NULL    |       |
| Tag11 | varchar(500) | YES  | MUL | NULL    |       |
+-------+--------------+------+-----+---------+-------+
13 rows in set (0.02 sec)
 
MariaDB [mapdata]> select * from tst \G
*************************** 1. row ***************************
  Idx: 2044
Tag00: name => Park Boulevard
Tag01: highway => tertiary
Tag02: maxspeed => 35 mph
Tag03: NULL
Tag04: NULL
Tag05: NULL
Tag06: NULL
Tag07: NULL
Tag08: NULL
Tag09: NULL
Tag10: NULL
Tag11: NULL
*************************** 2. row ***************************
  Idx: 379
Tag00: name => Jurgella Lane
Tag01: highway => residential
Tag02: NULL
Tag03: NULL
Tag04: NULL
Tag05: NULL
Tag06: NULL
Tag07: NULL
Tag08: NULL
Tag09: NULL
Tag10: NULL
Tag11: NULL
*************************** 3. row ***************************
  Idx: 1856
Tag00: name => Cottonwood Springs Road
Tag01: is_in => Riverside,CA
Tag02: highway => residential
Tag03: NULL
Tag04: NULL
Tag05: NULL
Tag06: NULL
Tag07: NULL
Tag08: NULL
Tag09: NULL
Tag10: NULL
Tag11: NULL
*************************** 4. row ***************************
  Idx: 1855
Tag00: highway => path
Tag01: NULL
Tag02: NULL
Tag03: NULL
Tag04: NULL
Tag05: NULL
Tag06: NULL
Tag07: NULL
Tag08: NULL
Tag09: NULL
Tag10: NULL
Tag11: NULL
*************************** 5. row ***************************
  Idx: 889
Tag00: hgv => designated
Tag01: ref => US 51 Business
Tag02: name => Division Street
Tag03: oneway => yes
Tag04: highway => primary
Tag05: maxspeed => 35 mph
Tag06: NULL
Tag07: NULL
Tag08: NULL
Tag09: NULL
Tag10: NULL
Tag11: NULL
5 rows in set (0.00 sec)
 
MariaDB [mapdata]> select * from tst where match(tag00,tag01,tag02,tag03,tag04,
tag05,tag06,tag07,tag08,tag09,tag10,tag11) against ('name' in boolean mode) ;
Empty set (0.02 sec)
 
MariaDB [mapdata]> select * from tst where match(tag00,tag01,tag02,tag03,tag04,
tag05,tag06,tag07,tag08,tag09,tag10,tag11) against ('oneway' in boolean mode) ;
+-----+-------------------+-----------------------+-------------------------+---------------+--------------------+--------------------+-------+-------+-------+-------+-------+-------+
| Idx | Tag00             | Tag01                 | Tag02                   | Tag03         | Tag04              | Tag05              | Tag06 | Tag07 | Tag08 | Tag09 | Tag10 | Tag11 |
+-----+-------------------+-----------------------+-------------------------+---------------+--------------------+--------------------+-------+-------+-------+-------+-------+-------+
| 889 | hgv => designated | ref => US 51 Business | name => Division Street | oneway => yes | highway => primary | maxspeed => 35 mph | NULL  | NULL  | NULL  | NULL  | NULL  | NULL  |
+-----+-------------------+-----------------------+-------------------------+---------------+--------------------+--------------------+-------+-------+-------+-------+-------+-------+
1 row in set (0.00 sec)
 
MariaDB [mapdata]>

Comment by M MacDonald [ 2017-05-24 ]

oops. I've no idea where that big gap came from, nor how to fix it. I'll try putting in that part of the text again:

MariaDB [mapdata]> select * from tst \G

Idx: 2044
Tag00: name => Park Boulevard
Tag01: highway => tertiary
Tag02: maxspeed => 35 mph
Tag03: NULL
Tag04: NULL
Tag05: NULL
Tag06: NULL
Tag07: NULL
Tag08: NULL
Tag09: NULL
Tag10: NULL
Tag11: NULL

Idx: 379
Tag00: name => Jurgella Lane
Tag01: highway => residential
Tag02: NULL
Tag03: NULL
Tag04: NULL
Tag05: NULL
Tag06: NULL
Tag07: NULL
Tag08: NULL
Tag09: NULL
Tag10: NULL
Tag11: NULL

Idx: 1856
Tag00: name => Cottonwood Springs Road
Tag01: is_in => Riverside,CA
Tag02: highway => residential
Tag03: NULL
Tag04: NULL
Tag05: NULL
Tag06: NULL
Tag07: NULL
Tag08: NULL
Tag09: NULL
Tag10: NULL
Tag11: NULL

Idx: 1855
Tag00: highway => path
Tag01: NULL
Tag02: NULL
Tag03: NULL
Tag04: NULL
Tag05: NULL
Tag06: NULL
Tag07: NULL
Tag08: NULL
Tag09: NULL
Tag10: NULL
Tag11: NULL

Idx: 889
Tag00: hgv => designated
Tag01: ref => US 51 Business
Tag02: name => Division Street
Tag03: oneway => yes
Tag04: highway => primary
Tag05: maxspeed => 35 mph
Tag06: NULL
Tag07: NULL
Tag08: NULL
Tag09: NULL
Tag10: NULL
Tag11: NULL
5 rows in set (0.00 sec)

Comment by M MacDonald [ 2017-05-24 ]

aha! Apparently the row of stars that mariadb puts into a \G output to separate the records gets jira upset. That's a bug!

Comment by Sergei Golubchik [ 2017-05-26 ]

This is not a bug, and it has nothing to do with 50%. The word "name" would not have been found, even if it was present only in one row. Because it is in the default stopword list: https://mariadb.com/kb/en/mariadb/stopwords/

That page also explains how to modify the stopword list.

Comment by M MacDonald [ 2017-05-27 ]

I apologise for not being more suspicious. It simply never crossed my mind that "name" might be excluded by default from search.

I just now looked at the "stopwords" (odd term) lists for the first time and was dismayed to see the size and makeup of the isam list. I expected it to look like the innodb list: a small number of "glue" words, not such a large number of basic nouns, verbs, adjectives, and adverbs. I wonder who made up that list, and what motivated those choices---they seem quite arbitrary, from here.

Comment by Sergei Golubchik [ 2017-05-27 ]

"stop word" is a standard term in the Information Retrieval, see https://www.google.de/search?q=stopword

When you search in a collection (this is Information Retrieval terminology, in MariaDB it'd be "in a table"), the found documents ("rows") are assigned a "relevance" value. The relevance depends on how many times a particular word was present in a particular row and in how many rows a particular word was present in total. Words that are present in a lot of rows will provide a very small, practically neglectable, increase in relevance. On the other hand, these words significantly increase the index size. That's the same Pareto law, 90% of the words attribute only for 10% of the index size, and the 90% of the index size is filled with 10% of the very popular words. These popular words are useless to search for, because they are present in pretty much every row. That's the logic behind stopwords, by not indexing words with very low semantic value, we simultaneously decrease the index size (10x) and improve search results (by removing noise matches) and search speed too.

Of course, stopword list is not universal. The popular word in one data set, may be a very rare and unique in some other data set.

Our particular list of stop words is a generic English stop word list, taken from SMART Information Retrieval system. We tested it on SMART test document collections and found that it performs very well (improving the index size and relevance of the results).

Generated at Thu Feb 08 08:01:07 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.