[MDEV-12910] FULLTEXT: Num cols returned on hit should be switch-selectable Created: 2017-05-24  Updated: 2017-05-30  Resolved: 2017-05-27

Status: Closed
Project: MariaDB Server
Component/s: Full-text Search
Affects Version/s: 10.1.22
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: M MacDonald Assignee: Sergei Golubchik
Resolution: Won't Fix Votes: 0
Labels: None
Environment:

FreeBSD 10.3, 2017Q2



 Description   

Maybe this will be construed as a Task per policy, but it's at least a human-factors Bug, if it now works the way intended.

Currently a hit on even 1 column in a multi-column search returns all columns. That should be a per-call switch (RETURN_ALL_COLS, RETURN_COLS_HIT_ONLY)...or, slightly more succinctly, ON_HIT_RETURN_HIT, ON_HIT_RETURN_ALL_COLS).

Returning only the column(s) hit reduces or eliminates the need for post-hoc filtering when it's not known in advance which column has which search target in a given record and the columns not hit are not important.



 Comments   
Comment by Sergei Golubchik [ 2017-05-25 ]

What do you mean by that? Can you show an example? With CREATE TABLE and a SELECT?

As far as I understand, MATCH…AGAINST does not return columns, it returns a floating point relevance value.

Comment by M MacDonald [ 2017-05-25 ]

See MDEV-12871 illustrating the failure of IN BOOLEAN MODE to overcome the documented 50% limit,

I didn't preserve the text of the CREATE for my test table, but: I created a FULLTEXT key for each of the columns to be collectively searched as well as a FULLTEXT key for all the columns taken together.

I also provided the text of the SELECTs and outcome of searching for a target that appears only 1 time in 5 records, (the search succeeds and returns all columns), and searching for a target that appears in 4 of the 5 records (the search fails and returns an empty set).

That floating point value is returned if you use a flag to request that it show the relevance value it used, but it's not returned unless requested.

Comment by Sergei Golubchik [ 2017-05-26 ]

Yes, but how ON_HIT_RETURN_HIT, ON_HIT_RETURN_ALL_COLS would work? Can you show an example of a query that uses ON_HIT_RETURN_HIT and explain what it should return?

Comment by M MacDonald [ 2017-05-27 ]

Currently, my search on 'name' (with the stoplist turned off) returns what would be the ON_HIT_RETURN_ALL (or just 'RETURN_ALL') case:

MariaDB [mapdata]> select * from tst where match(t00,t01,t02,t03,t04,t05,t06,t07,t08,t09,t10,t11) against ('name' in boolean mode) \G

Idx: 1065
T00: name => Jefferson Street
T01: highway => tertiary
T02: NULL
T03: NULL
T04: NULL
T05: NULL
T06: NULL
T07: NULL
T08: NULL
T09: NULL
T10: NULL
T11: NULL

Idx: 1856
T00: name => Cottonwood Springs Road
T01: is_in => Riverside,CA
T02: highway => residential
T03: NULL
T04: NULL
T05: NULL
T06: NULL
T07: NULL
T08: NULL
T09: NULL
T10: NULL
T11: NULL

Idx: 889
T00: hgv => designated
T01: ref => US 51 Business
T02: name => Division Street
T03: oneway => yes
T04: highway => primary
T05: maxspeed => 35 mph
T06: NULL
T07: NULL
T08: NULL
T09: NULL
T10: NULL
T11: NULL

Idx: 929
T00: name => North Point Drive
T01: highway => tertiary
T02: hgv => designated
T03: NULL
T04: NULL
T05: NULL
T06: NULL
T07: NULL
T08: NULL
T09: NULL
T10: NULL
T11: NULL
4 rows in set (0.00 sec)

MariaDB [mapdata]>

Whereas, with a default of ON_HIT_RETURN_HIT (or just 'RETURN_HITS' ), I would have instead got back only the field(s) with the hit(s):

MariaDB [mapdata]> select * from tst where match(t00,t01,t02,t03,t04,t05,t06,t07,t08,t09,t10,t11) against ('name' in boolean mode) \G

Idx: 1065
T00: name => Jefferson Street

Idx: 1856
T00: name => Cottonwood Springs Road

Idx: 889
T02: name => Division Street

Idx: 929
T00: name => North Point Drive

4 rows in set (0.00 sec)

MariaDB [mapdata]>

(This edit box really does not like those rows of stars!)

Comment by Sergei Golubchik [ 2017-05-27 ]

Sorry, but what you're asking it not possible. MATCH…AGAINST is just a regular function. Nothing more. It returns one number — relevance value. When you write

select * from tst where match(t00,t01,t02,t03,t04,t05,t06,t07,t08,t09,t10,t11) against ('name' in boolean mode);

The WHERE condition implicitly means

      where match(...) against ('name' in boolean mode) != 0;

That is, you select all rows with a non-zero relevance. Not much different from

      where sin(x);

Which selects rows where sin(x) is non-zero.
The first clause — SELECT * — means "all rows", that's SQL Standard, and this interpretation cannot possibly be influenced by whatever function is used in the WHERE clause.

Also, the semantics of your extension is clear only in the most simple case. What if this word is found in different columns in different rows? What if there's more than one MATCH...AGAINST? On different sets of columns? On different tables? In a UNION? In a subquery? in a view? All the above?

(when you edit, there's a small icon under the edit form, it explains Jira markup syntax. In particular, for the fixed-width text you want to use {noformat}....{noformat} tags)

Comment by M MacDonald [ 2017-05-27 ]

Saying it's "not possible" makes it seem as though what gets returned is a function of physical law. But of course it's not. We humans determine what our code returns.

The syntax INSERT INTO ... SET ... is non-standard, but it works and is much easier to get right than VALUES. But somebody determined that the VALUES syntax would be the only official SQL syntax because they didn't understand or perhaps didn't care about human needs.

The C syntax that makes it impossible for the compiler to catch the typo '=' that was meant to be '==' is another case of poor design. That's a very hard error even for a human to catch, and we are wonderful pattern-matchers. Had Dennis Ritchie been more aware of human cognitive limitations, or cared about them, he could have chosen a Fortran-like notation for boolean tests instead: '=' cannot be mistaken for 'EQ', 'NE', etc., so such mistakes are much harder to make and much easier to catch. Offering 'EQ' etc. as an alternate syntax would be an example of Knuth's "syntactic sugar" that would improve both productivity and readability, but C fundamentalists are completely opposed to it because they regard Ritchie's personal choices as near-sacred. (I finally gave up suggesting it and just used the preprocessor to take care of my own code.)

A FULLTEXT (T0,T1) key doesn't serve the same purpose as an INDEX (T0,T1) key, so why should it have the same limitations on returns?

Finding a target in different columns in different rows is shown in my example: most hits are in T00, but one is in T02. That's easier to see without all the non-hits also being returned.

The readability of JOINs, UNIONs, etc. would be improved by not returning unwanted columns. The key understanding here is that non-hits are unwanted. They may be wanted later, but they shouldn't be returned by a SELECT that doesn't ask for them. I'm forced to use SELECT * because I don't know which column my target is in, not because I want all the columns ---that's a way in which FULLTEXT could and should provide very different value.

My example table was taken from Open Street Map. The scratch table I'm working on has over 100M key/value records, a very large number of the keys and values are complete rubbish, and the rest have unpredictable positions in each record (see my T00 vs T02 hits). I won't be using the current implementation of FULLTEXT to do my searches because there's no advantage in it for me as long as I have to do post-processing to find the only fields I care about.

Comment by Sergei Golubchik [ 2017-05-27 ]

I agree, I also hate the word "impossible" in this context.

Still, while I understand the intention of your proposed feature, I don't see how to describe the semantics that consistently works in all cases. And how to explain logically why a function from the WHERE clause affects * field list in the SELECT clause.

If you only want to see matching columns, you can change your table structure and use key/value scheme, like

CREATE TABLE tst ( Idx int, key char(3), value varchar(100));
insert tst values (1065,'t00', 'name => Jefferson Street'), ...

Comment by M MacDonald [ 2017-05-27 ]

I don't see how to describe the semantics that consistently works in all cases

I'm not sure what problem you're envisioning, because when I think about doing it, it's as straightforward as it was for me to knock out those unwanted returns in my example. This isn't a case where a completely different presentation would be required.

Right now, we can identify some subset for SELECTion by enumerating each field. And if we want all the fields, we have the star shorthand. Either syntax gives us exactly what we want, we don't have to accept fields we don't want.

But in the case where we don't know from record to record what field the desired value is in, we need either a different syntax for the SELECT, or we need to prune the return. Not to do it is the equivalent of only having SELECT * as a tool – grab everything and throw away what you don't want. Rather wasteful.

Suppose (I'm not suggesting this, it just allows an example that's easier to think about) we extended the syntax such that

SELECT # [or some other symbol] FROM foo WHERE # LIKE '%name%'

meant

SELECT any and all fields FROM foo WHERE there is a match for '%name%'

That's exactly what we can do now except that we have to name the exact field that will have the value, if the value is in a given record. We certainly know how to format output from unions, joins, and other combinations now. What would change? By the time output needs to be formatted, all uncertainty has been resolved.

Comment by Sergei Golubchik [ 2017-05-28 ]

Here's an example, where the semantics of your extention becomes complicated:

CREATE TABLE t1 (t11 TEXT, t12 TEXT, t13 TEXT, FULLTEXT(t11, t12), FULLTEXT(t13));
CREATE TABLE t2 (t21 TEXT, t22 TEXT, t23 TEXT, FULLTEXT(t21, t22), FULLTEXT(t23));
CREATE TABLE t3 (t31 TEXT, t32 TEXT, t33 TEXT, FULLTEXT(t31, t32), FULLTEXT(t33));
CREATE TABLE t4 (t41 TEXT, t42 TEXT, t43 TEXT, FULLTEXT(t41, t42), FULLTEXT(t43));
-- some inserts here ...
SELECT * FROM t1, t2, (SELECT * FROM t3
                       WHERE MATCH(t31, t32) AGAINST ('word31') > MATCH (t33) AGAINST ('word32')) AS t3d
         WHERE MATCH (t11, t12) AGAINST ('word11') AND MATCH (t21, t22) AGAINST ('word21') > 1
         HAVING MATCH (t13) AGAINST ('word12')
         ORDER BY MATCH (t23 AGAINST ('word22')
UNION SELECT * FROM t4 WHERE MATCH (t41, t42) AGAINST ('word41')

But anyway, here's another reason for “impossible”: when using prepared statement, one usually does

  • prepare ("select ...")
  • read result metadata — number of columns, column types, etc
  • bind output variables, if needed
  • execute the statement
  • fetch results, row by row

so there is no possible way to change the number of result columns during the execution, because it must be known after the prepare.

Comment by M MacDonald [ 2017-05-29 ]

Looking at your 4-table example, I can't really imagine why anyone would want to create such a thing in real life. Trying to envision a plausible context for it, my mind just stumbles to a halt. Did you have anything real in mind, or was it just meant to be Something Very Complicated?

Apropos prep'd statements: I've never yet had occasion to use one, but it seems to me that you've left off the final step :

  • execute the statement
  • fetch results, row by row
  • post-process each row to remove unwanted results

Regardless, it'd be nice to have the feature, but it doesn't look like that's going to happen. So I think I'll go bang my head on some other wall awhile (maybe the one for which I would have used this feature). Perhaps someone else will come along later and make a more persuasive argument than I've done.

Comment by Sergei Golubchik [ 2017-05-29 ]

About the complex example: I don't think it makes any practical sense. My point is — the feature designed for some specific use case, should work great in this use case, of course. But it should work somehow in all other, even most contrived, cases, it should not crash, should not deliver wrong results. It should be defined for any query anyone can come up with — it doesn't have to work good, we don't care how it'll work, but it has to work somehow.

Of course, one option would be to return an error if ON_HIT_RETURN_HIT is used in any other context besides

SELECT * FROM table MATCH .. AGAINST (...)

This is one perfectly valid way to define what happens in all complex cases. It's a very unexpected (for a user) and inconsistent (we don't have anything that works like that) way, but it's valid.

Generated at Thu Feb 08 08:01:27 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.