[MDEV-24211] FULLTEXT search does not honor case sensitive _bin collations Created: 2020-11-14  Updated: 2023-04-27

Status: Stalled
Project: MariaDB Server
Component/s: Character Sets, Full-text Search
Affects Version/s: 10.0, 10.1, 10.3.25, 10.4.14, 10.5.5, 10.2, 10.3, 10.4, 10.5
Fix Version/s: 10.4, 10.5

Type: Bug Priority: Major
Reporter: Lau Assignee: Thirunarayanan Balathandayuthapani
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Relates
relates to MDEV-9232 Fulltext index on a given column caus... Confirmed

 Description   

This issue seems to exist in all versions of MariaDB and all Versions of MySQL.

Run the following SQL script:

drop database if exists `MatchQueryTest`;
create database `MatchQueryTest`;
use `MatchQueryTest`;
 
CREATE TABLE `Herb` (
    `Id` int NOT NULL AUTO_INCREMENT,
    `Name` varchar(255) character set latin1 collate latin1_bin NULL,
    CONSTRAINT `PK_Herb` PRIMARY KEY (`Id`)
);
 
INSERT INTO `Herb` (`Id`, `Name`)
VALUES (1, 'First Herb Name 1');
 
CREATE FULLTEXT INDEX `IX_Herb_Name` ON `Herb` (`Name`);
 
SELECT COUNT(*)
FROM `Herb` AS `h`
WHERE MATCH (`h`.`Name`) AGAINST ('First*' IN BOOLEAN MODE);

I expect the last SELECT COUNT(*) query to return 1, but instead it returns 0.

The query works, when the collation is changed from latin1_bin to latin1_general_cs. However, since there are no utf8mb4_general_cs (or utf8_general_cs) or similar collations, this will definitely not work with the UTF-8 character set, which is the defacto standard nowadays.



 Comments   
Comment by Alice Sherepa [ 2020-11-16 ]

Thanks for the report! Repeatable on 10.0-10.5 with InnoDB.
Somehow it also depends on the order of adding ft index - before or after insert. + capital letter + collation.

MariaDB [test]> create table t1 (n varchar(255) character set latin1 collate latin1_bin )engine=innodb;
Query OK, 0 rows affected (0.049 sec)
 
MariaDB [test]> create fulltext index indx on t1 (n); 
Query OK, 0 rows affected (0.299 sec)
Records: 0  Duplicates: 0  Warnings: 0
 
MariaDB [test]> insert into t1 values ('Kirst');
Query OK, 1 row affected (0.006 sec)
 
MariaDB [test]> select * from t1 where match(n) against ("Kirst*" in boolean mode);
+-------+
| n     |
+-------+
| Kirst |
+-------+
1 row in set (0.003 sec)
 
MariaDB [test]> create table t2 (n varchar(255) character set latin1 collate latin1_bin )engine=innodb;
Query OK, 0 rows affected (0.066 sec)
 
MariaDB [test]> insert into t2 values ('Kirst');
Query OK, 1 row affected (0.017 sec)
 
MariaDB [test]> create fulltext index indx on t2 (n); 
Query OK, 0 rows affected (0.451 sec)
Records: 0  Duplicates: 0  Warnings: 0
 
MariaDB [test]> select * from t2 where match(n) against ("Kirst*" in boolean mode);
Empty set (0.008 sec)

--source include/have_innodb.inc
 
create table t1 (n varchar(255) character set latin1 collate latin1_bin )engine=innodb;
create fulltext index indx on t1 (n); 
insert into t1 values ('Kirst');
 
select * from t1 where match(n) against ("Kirst*" in boolean mode);
 
create table t2 (n varchar(255) character set latin1 collate latin1_bin )engine=innodb;
insert into t2 values ('Kirst');
create fulltext index indx on t2 (n); 
 
select * from t2 where match(n) against ("Kirst*" in boolean mode);

Comment by Thirunarayanan Balathandayuthapani [ 2020-12-15 ]

During alter table rebuild, InnoDB does convert all token to lower case.

In row_merge_fts_doc_tokenize():
 
                t_str.f_len = innobase_fts_casedn_str(
                        doc->charset, (char*) str.f_str, str.f_len,
                        (char*) &str_buf, FTS_MAX_WORD_LEN + 1);
                
                t_str.f_str = (byte*) &str_buf;

InnoDB should avoid the conversion of lower case during alter. Is there any way to find binary collation from charset ?

Generated at Thu Feb 08 09:28:16 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.