[MDEV-35246] Vector search skips a row in the table - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Critical
Resolution: Fixed
Affects Version/s: N/A
Fix Version/s: 11.7.1
Component/s: Vector search
Labels:
None

Description

The issue was observed in manual testing.

By the beginning of events, the server wasn't completely fresh, but the table in question was created, dropped and re-created several times without problems, and to my knowledge there were no visible old artifacts related to the table.

An InnoDB table with VECTOR(1) field and vector key was created and populated with 100 rows, random values in the vector column. The table also had table-level with system versioning and column-level without system versioning for the vector column, I'm not sure whether it's important.

Then some operations were performed, on this table as well as on other ones; importantly 80 out of 100 rows were deleted from the table, and for the rest the vector column was updated to a single constant value.

After that, the table scan continued returning 20 rows as it should (or 100 rows with system_time all), but vector search with limit 30 started returning only 19 rows.

Once it started happening, it became persistent – after restarting the server, or using the same datadir on another server, the result set is still wrong.

I have the unabridged record of console operations and output (attached as console.txt). It is quite noisy and contains a lot of irrelevant information and user errors usual for manual operations, but as far as I can tell nothing unexpected unless the wrong result set started happening. The described events start from the line 840. The wrong result occurs for the first time at line 1380.

However, I couldn't reproduce the issue by repeating all the same operations in the same order.

I don't have rand seeds for inserting data, but the inserted data was printed in the output, so while trying to reproduce, I converted it into INSERT .. VALUES instead of using RAND(). It didn't help.

All in all, it seems that something random may have happened internally, maybe during updating the index. I am attaching the datadir in case it can be analyzed based on already existing data. It was created with 7fce19bd215ac0671855044520092aa4210049d1 Just unpack and run the server on it, all default options are sufficient.

Here is the erroneous result:

MariaDB [test]> select a, vec_totext(v) from t  order by vec_distance_euclidean(v,0x31313131) limit 30;

+------+---------------+

| a    | vec_totext(v) |

+------+---------------+

|    1 | [2.57849e-9]  |

|   20 | [2.57849e-9]  |

|   11 | [2.57849e-9]  |

|    6 | [2.57849e-9]  |

|    7 | [2.57849e-9]  |

|    3 | [2.57849e-9]  |

|    5 | [2.57849e-9]  |

|    8 | [2.57849e-9]  |

|    9 | [2.57849e-9]  |

|    2 | [2.57849e-9]  |

|    4 | [2.57849e-9]  |

|   12 | [2.57849e-9]  |

|   19 | [2.57849e-9]  |

|   18 | [2.57849e-9]  |

|   17 | [2.57849e-9]  |

|   15 | [2.57849e-9]  |

|   14 | [2.57849e-9]  |

|   13 | [2.57849e-9]  |

|   10 | [2.57849e-9]  |

+------+---------------+

19 rows in set (0.002 sec)

comparing to

MariaDB [test]> select a, vec_totext(v) from t  order by vec_distance_euclidean(v,0x31313131) ;

+------+---------------+

| a    | vec_totext(v) |

+------+---------------+

|    1 | [2.57849e-9]  |

|   12 | [2.57849e-9]  |

|   13 | [2.57849e-9]  |

|   14 | [2.57849e-9]  |

|   15 | [2.57849e-9]  |

|   16 | [2.57849e-9]  |

|   17 | [2.57849e-9]  |

|   18 | [2.57849e-9]  |

|   19 | [2.57849e-9]  |

|   11 | [2.57849e-9]  |

|   10 | [2.57849e-9]  |

|    2 | [2.57849e-9]  |

|    3 | [2.57849e-9]  |

|    4 | [2.57849e-9]  |

|    5 | [2.57849e-9]  |

|    6 | [2.57849e-9]  |

|    7 | [2.57849e-9]  |

|    8 | [2.57849e-9]  |

|    9 | [2.57849e-9]  |

|   20 | [2.57849e-9]  |

+------+---------------+

20 rows in set (0.003 sec)

If it can't be analyzed this way, please feel free to close.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

console.txt
74 kB
2024-10-23 23:55
data.tar.gz
3.22 MB
2024-10-23 23:55

Issue Links

is caused by

MDEV-34939 vector search in 11.7

Closed

relates to

MDEV-8306 Complete cost-based optimization for ORDER BY with LIMIT

Stalled

Activity

Ascending order - Click to sort in descending order

Sergei Golubchik added a comment - 2024-10-24 08:58

This is kind of expected.The search is approximate and it might skip some rows — this is why recall isn't 100%.

Optimizer shouldn't use the index when retrieving the whole table, it's MDEV-8306

Sergei Golubchik added a comment - 2024-10-24 08:58 This is kind of expected.The search is approximate and it might skip some rows — this is why recall isn't 100%. Optimizer shouldn't use the index when retrieving the whole table, it's MDEV-8306

Elena Stepanova added a comment - 2024-10-24 09:06

This looks really strange. It's one thing to skip rows when you still return the requested amount of data, just not necessarily the best matches – that's quite understandable, this is approximate search.
But here the "approximate" part is not just the search, it is the limit itself. I don't really see how it can be explained to a user that the table has 20 rows, and the query with LIMIT 30 only returns 19 rows.

Elena Stepanova added a comment - 2024-10-24 09:06 This looks really strange. It's one thing to skip rows when you still return the requested amount of data, just not necessarily the best matches – that's quite understandable, this is approximate search. But here the "approximate" part is not just the search, it is the limit itself. I don't really see how it can be explained to a user that the table has 20 rows, and the query with LIMIT 30 only returns 19 rows.

Sergei Golubchik added a comment - 2024-10-24 14:55 - edited

It's MDEV-8306 — the index shouldn't have been used. If a table has thousand rows, then LIMIT 20 will return 20 rows. It might skip 15th best, and return 21st best instead (because the search is approximate), but it'll return 20 rows.

Here the table only has 20 rows, so if it'll skip one, you'll get less rows in the result set. Vector index is slower than filesort when you need all rows, so the optimizer shouldn't have used it unless the LIMIT is much smaller than the table size.

Sergei Golubchik added a comment - 2024-10-24 14:55 - edited It's MDEV-8306 — the index shouldn't have been used. If a table has thousand rows, then LIMIT 20 will return 20 rows. It might skip 15th best, and return 21st best instead (because the search is approximate), but it'll return 20 rows. Here the table only has 20 rows, so if it'll skip one, you'll get less rows in the result set. Vector index is slower than filesort when you need all rows, so the optimizer shouldn't have used it unless the LIMIT is much smaller than the table size.

Elena Stepanova added a comment - 2024-10-24 17:07

I'm not certain it is this simple.
While I couldn't reproduce the exact issue described initially, I came up with a different case derived from that one. I think it is related.

--source include/have_sequence.inc

create or replace table t (a int, v vector(1) not null, vector(v));

insert into t select seq, vec_fromtext(concat('[',seq,']')) from seq_1_to_200;

update t set v = vec_fromtext(concat('[33]')) where a <= 15;

select a, vec_totext(v) from t order by vec_distance_euclidean(v,vec_fromtext('[33]')) limit 25;

# Cleanup

drop table t;

7fce19bd215ac0671855044520092aa4210049d1
MariaDB [test]> select a, vec_totext(v) from t order by vec_distance_euclidean(v,vec_fromtext('[33]')) limit 25;
+------+---------------+
\| a \| vec_totext(v) \|
+------+---------------+
\| 4 \| [33] \|
\| 15 \| [33] \|
\| 10 \| [33] \|
\| 33 \| [33] \|
\| 8 \| [33] \|
\| 5 \| [33] \|
\| 1 \| [33] \|
\| 2 \| [33] \|
\| 3 \| [33] \|
\| 6 \| [33] \|
\| 9 \| [33] \|
\| 11 \| [33] \|
\| 14 \| [33] \|
\| 13 \| [33] \|
\| 12 \| [33] \|
\| 7 \| [33] \|
+------+---------------+
16 rows in set (0.002 sec)

So, here the table has 200 rows total, limit is 25 (which should take MDEV-8306 off the table), but the query only returns 16 rows (and it will return 15 if I use [0] instead of [33], I took the latter to see if it will actually find anything that was there before UPDATE). Does it still fit your previous explanation?

To me it looks more like something special about having exact matches maybe, but of course I'm just guessing here.

Elena Stepanova added a comment - 2024-10-24 17:07 I'm not certain it is this simple. While I couldn't reproduce the exact issue described initially, I came up with a different case derived from that one. I think it is related. --source include/have_sequence.inc create or replace table t (a int , v vector(1) not null , vector(v)); insert into t select seq, vec_fromtext(concat( '[' ,seq, ']' )) from seq_1_to_200; update t set v = vec_fromtext(concat( '[33]' )) where a <= 15; select a, vec_totext(v) from t order by vec_distance_euclidean(v,vec_fromtext( '[33]' )) limit 25; # Cleanup drop table t; 7fce19bd215ac0671855044520092aa4210049d1 MariaDB [test]> select a, vec_totext(v) from t order by vec_distance_euclidean(v,vec_fromtext( '[33]' )) limit 25; + ------+---------------+ | a | vec_totext(v) | + ------+---------------+ | 4 | [33] | | 15 | [33] | | 10 | [33] | | 33 | [33] | | 8 | [33] | | 5 | [33] | | 1 | [33] | | 2 | [33] | | 3 | [33] | | 6 | [33] | | 9 | [33] | | 11 | [33] | | 14 | [33] | | 13 | [33] | | 12 | [33] | | 7 | [33] | + ------+---------------+ 16 rows in set (0.002 sec) So, here the table has 200 rows total, limit is 25 (which should take MDEV-8306 off the table), but the query only returns 16 rows (and it will return 15 if I use [0] instead of [33] , I took the latter to see if it will actually find anything that was there before UPDATE). Does it still fit your previous explanation? To me it looks more like something special about having exact matches maybe, but of course I'm just guessing here.

Sergei Golubchik added a comment - 2024-10-24 20:21

Yes, absolutely, this is definitely expected, it's how the streaming mode works. It tends to skip vectors that have exactly the same distance to the target. Which is your case here. And it is basically never the case in real world data. The algorithm is tuned and optimized for real world use cases, and does not perform well on artificial ones.

Sergei Golubchik added a comment - 2024-10-24 20:21 Yes, absolutely, this is definitely expected, it's how the streaming mode works. It tends to skip vectors that have exactly the same distance to the target. Which is your case here. And it is basically never the case in real world data. The algorithm is tuned and optimized for real world use cases, and does not perform well on artificial ones.

Elena Stepanova added a comment - 2024-10-24 20:52 - edited

I don't understand. It didn't skip vectors which have exactly the same distance to the target, it returned only vectors which have exactly the same distance to the target (zero), and skipped everything else.

Of course it can happen in real life that records have identical vectors – duplicate records, plagiary, different sources quoting the same text, and so on. I think it's more unrealistic to expect that in the large amount of data all vectors will always be unique.

Elena Stepanova added a comment - 2024-10-24 20:52 - edited I don't understand. It didn't skip vectors which have exactly the same distance to the target, it returned only vectors which have exactly the same distance to the target (zero), and skipped everything else. Of course it can happen in real life that records have identical vectors – duplicate records, plagiary, different sources quoting the same text, and so on. I think it's more unrealistic to expect that in the large amount of data all vectors will always be unique.

Sergei Golubchik added a comment - 2024-10-25 11:11

Of course it can happen in real life that records have identical vectors – duplicate records, plagiary, different sources quoting the same text, and so on. I think it's more unrealistic to expect that in the large amount of data all vectors will always be unique.

Yes. And it does't always skip them. It can but doesn't have to. You're saying, basically, that identical vectors are caused by identical documents. In that case, potentially omitting some of them from the output (which then will be used as a context for RAG) is acceptable. Or may be not, benchmark is still running. So this behavior might still change.

Sergei Golubchik added a comment - 2024-10-25 11:11 Of course it can happen in real life that records have identical vectors – duplicate records, plagiary, different sources quoting the same text, and so on. I think it's more unrealistic to expect that in the large amount of data all vectors will always be unique. Yes. And it does't always skip them. It can but doesn't have to. You're saying, basically, that identical vectors are caused by identical documents. In that case, potentially omitting some of them from the output (which then will be used as a context for RAG) is acceptable. Or may be not, benchmark is still running. So this behavior might still change.

Elena Stepanova added a comment - 2024-10-25 11:55

In that case, potentially omitting some of them from the output (which then will be used as a context for RAG) is acceptable.

I quite agree, skipping identical documents and returning next alternatives instead makes a lot of sense (I'm not sure it is acceptable when it leads to returning less rows than requested, but let it be for now).
But as I said before, in the example above it doesn't skip identical rows, it does the opposite – it returns only identical rows and skips everything else.

The logical outcomes would be
a) straightforward: return 16 identical (closest) rows and then 9 other good enough results;
b) "skip vectors that have exactly the same distance to the target" logic: return 1 row out of 16 identical ones, plus 24 other good enough results;
c) something in between (it's approximate after all): return N rows out of 16 identical ones (skipping 16-N), plus 25-N other good enough results;
d) flawed "something in between" logic (affecting the size of the result set): find N identical rows and 25-N other good enough results, after that throw away N-1 duplicates, thus returning 25-N+1 rows, out of which one is a representative of the "identical" group, and 25-N are other good enough results.

Any of them could be explained, but I don't see how "return 16 identical rows and nothing else when 25 rows are requested" can be justified.

Elena Stepanova added a comment - 2024-10-25 11:55 In that case, potentially omitting some of them from the output (which then will be used as a context for RAG) is acceptable. I quite agree, skipping identical documents and returning next alternatives instead makes a lot of sense (I'm not sure it is acceptable when it leads to returning less rows than requested, but let it be for now). But as I said before, in the example above it doesn't skip identical rows, it does the opposite – it returns only identical rows and skips everything else . The logical outcomes would be a) straightforward: return 16 identical (closest) rows and then 9 other good enough results; b) "skip vectors that have exactly the same distance to the target" logic: return 1 row out of 16 identical ones, plus 24 other good enough results; c) something in between (it's approximate after all): return N rows out of 16 identical ones (skipping 16-N), plus 25-N other good enough results; d) flawed "something in between" logic (affecting the size of the result set): find N identical rows and 25-N other good enough results, after that throw away N-1 duplicates, thus returning 25-N+1 rows, out of which one is a representative of the "identical" group, and 25-N are other good enough results. Any of them could be explained, but I don't see how "return 16 identical rows and nothing else when 25 rows are requested" can be justified.

People

Assignee:: Sergei Golubchik

Reporter:: Elena Stepanova

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 2024-10-23 23:59

Updated:: 2025-03-15 20:48

Resolved:: 2024-11-06 13:53

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration