[MDEV-4850] MATCH uses a wrong fulltext index with mismatching collation Created: 2013-08-07  Updated: 2016-02-02

Status: Open
Project: MariaDB Server
Component/s: None
Affects Version/s: 5.5.32, 5.1.67, 5.2.14, 5.3.12
Fix Version/s: 5.5

Type: Bug Priority: Minor
Reporter: Alexander Barkov Assignee: Alexander Barkov
Resolution: Unresolved Votes: 0
Labels: None


 Description   

This script:

DROP TABLE IF EXISTS t1;
CREATE TABLE t1 (s1 VARCHAR (60) CHARACTER SET UTF8 COLLATE UTF8_UNICODE_CI) ENGINE = MyISAM;
CREATE FULLTEXT INDEX i ON t1 (s1);INSERT INTO t1 VALUES ('a'),('b'),('c'),('d'),('ÓÓÓÓ'),('OOOO'),(NULL),('ÓÓÓÓ ÓÓÓÓ'),('OOOOOOOO'); 
SELECT * FROM t1 WHERE MATCH(s1) AGAINST ('OOOO' COLLATE UTF8_POLISH_CI);

returns wrong results:

+-------------------+
| s1                |
+-------------------+
| ÓÓÓÓ              |
| OOOO              |
| ÓÓÓÓ ÓÓÓÓ         |
+-------------------+

The problem is that in the collation utf8_polish_ci
the letter "O WITH ACUTE" is not equal to "O":

MariaDB [test]> SELECT 'O'='Ó' COLLATE utf8_polish_ci;
+---------------------------------+
| 'O'='Ó' COLLATE utf8_polish_ci  |
+---------------------------------+
|                               0 |
+---------------------------------+

Therefore, returning the records containing accented letter
in the above query is wrong.

Further investigation with EXPLAIN:

MySQL [test]> EXPLAIN SELECT * FROM t1 WHERE MATCH(s1) AGAINST ('OOOO' COLLATE UTF8_POLISH_CI);
+----+-------------+-------+----------+---------------+------+---------+------+------+-------------+
| id | select_type | table | type     | possible_keys | key  | key_len | ref  | rows | Extra       |
+----+-------------+-------+----------+---------------+------+---------+------+------+-------------+
|  1 | SIMPLE      | t1    | fulltext | i             | i    | 0       | NULL |    1 | Using where |
+----+-------------+-------+----------+---------------+------+---------+------+------+-------------+

tells that MATCH erroneously chooses the non-suitable index "i"
whose collation is NOT equal to the collation of the operation.

If I further drop the index and use BOOLEAN MODE:

ALTER TABLE t1 DROP INDEX i;
SELECT * FROM t1 WHERE MATCH(s1) AGAINST ('OOOO' COLLATE UTF8_POLISH_CI IN BOOLEAN MODE);

it still returns wrong results:

+-------------------+
| s1                |
+-------------------+
| ÓÓÓÓ              |
| OOOO              |
| ÓÓÓÓ ÓÓÓÓ         |
+-------------------+

It seems MATCH always uses the collation of the field and does not follow the
standard argument collation aggregation rules.


Generated at Thu Feb 08 06:59:40 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.