Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-35246

Vector search skips a row in the table

Details

    • Bug
    • Status: Closed (View Workflow)
    • Critical
    • Resolution: Fixed
    • N/A
    • 11.7.1
    • Vector search
    • None

    Description

      The issue was observed in manual testing.

      By the beginning of events, the server wasn't completely fresh, but the table in question was created, dropped and re-created several times without problems, and to my knowledge there were no visible old artifacts related to the table.

      An InnoDB table with VECTOR(1) field and vector key was created and populated with 100 rows, random values in the vector column. The table also had table-level with system versioning and column-level without system versioning for the vector column, I'm not sure whether it's important.

      Then some operations were performed, on this table as well as on other ones; importantly 80 out of 100 rows were deleted from the table, and for the rest the vector column was updated to a single constant value.

      After that, the table scan continued returning 20 rows as it should (or 100 rows with system_time all), but vector search with limit 30 started returning only 19 rows.

      Once it started happening, it became persistent – after restarting the server, or using the same datadir on another server, the result set is still wrong.

      I have the unabridged record of console operations and output (attached as console.txt). It is quite noisy and contains a lot of irrelevant information and user errors usual for manual operations, but as far as I can tell nothing unexpected unless the wrong result set started happening. The described events start from the line 840. The wrong result occurs for the first time at line 1380.

      However, I couldn't reproduce the issue by repeating all the same operations in the same order.

      I don't have rand seeds for inserting data, but the inserted data was printed in the output, so while trying to reproduce, I converted it into INSERT .. VALUES instead of using RAND(). It didn't help.

      All in all, it seems that something random may have happened internally, maybe during updating the index. I am attaching the datadir in case it can be analyzed based on already existing data. It was created with 7fce19bd215ac0671855044520092aa4210049d1 Just unpack and run the server on it, all default options are sufficient.

      Here is the erroneous result:

      MariaDB [test]> select a, vec_totext(v) from t  order by vec_distance_euclidean(v,0x31313131) limit 30;
      +------+---------------+
      | a    | vec_totext(v) |
      +------+---------------+
      |    1 | [2.57849e-9]  |
      |   20 | [2.57849e-9]  |
      |   11 | [2.57849e-9]  |
      |    6 | [2.57849e-9]  |
      |    7 | [2.57849e-9]  |
      |    3 | [2.57849e-9]  |
      |    5 | [2.57849e-9]  |
      |    8 | [2.57849e-9]  |
      |    9 | [2.57849e-9]  |
      |    2 | [2.57849e-9]  |
      |    4 | [2.57849e-9]  |
      |   12 | [2.57849e-9]  |
      |   19 | [2.57849e-9]  |
      |   18 | [2.57849e-9]  |
      |   17 | [2.57849e-9]  |
      |   15 | [2.57849e-9]  |
      |   14 | [2.57849e-9]  |
      |   13 | [2.57849e-9]  |
      |   10 | [2.57849e-9]  |
      +------+---------------+
      19 rows in set (0.002 sec)
      

      comparing to

      MariaDB [test]> select a, vec_totext(v) from t  order by vec_distance_euclidean(v,0x31313131) ;
      +------+---------------+
      | a    | vec_totext(v) |
      +------+---------------+
      |    1 | [2.57849e-9]  |
      |   12 | [2.57849e-9]  |
      |   13 | [2.57849e-9]  |
      |   14 | [2.57849e-9]  |
      |   15 | [2.57849e-9]  |
      |   16 | [2.57849e-9]  |
      |   17 | [2.57849e-9]  |
      |   18 | [2.57849e-9]  |
      |   19 | [2.57849e-9]  |
      |   11 | [2.57849e-9]  |
      |   10 | [2.57849e-9]  |
      |    2 | [2.57849e-9]  |
      |    3 | [2.57849e-9]  |
      |    4 | [2.57849e-9]  |
      |    5 | [2.57849e-9]  |
      |    6 | [2.57849e-9]  |
      |    7 | [2.57849e-9]  |
      |    8 | [2.57849e-9]  |
      |    9 | [2.57849e-9]  |
      |   20 | [2.57849e-9]  |
      +------+---------------+
      20 rows in set (0.003 sec)
      

      If it can't be analyzed this way, please feel free to close.

      Attachments

        1. console.txt
          74 kB
        2. data.tar.gz
          3.22 MB

        Issue Links

          Activity

            This is kind of expected.The search is approximate and it might skip some rows — this is why recall isn't 100%.

            Optimizer shouldn't use the index when retrieving the whole table, it's MDEV-8306

            serg Sergei Golubchik added a comment - This is kind of expected.The search is approximate and it might skip some rows — this is why recall isn't 100%. Optimizer shouldn't use the index when retrieving the whole table, it's MDEV-8306

            This looks really strange. It's one thing to skip rows when you still return the requested amount of data, just not necessarily the best matches – that's quite understandable, this is approximate search.
            But here the "approximate" part is not just the search, it is the limit itself. I don't really see how it can be explained to a user that the table has 20 rows, and the query with LIMIT 30 only returns 19 rows.

            elenst Elena Stepanova added a comment - This looks really strange. It's one thing to skip rows when you still return the requested amount of data, just not necessarily the best matches – that's quite understandable, this is approximate search. But here the "approximate" part is not just the search, it is the limit itself. I don't really see how it can be explained to a user that the table has 20 rows, and the query with LIMIT 30 only returns 19 rows.
            serg Sergei Golubchik added a comment - - edited

            It's MDEV-8306 — the index shouldn't have been used. If a table has thousand rows, then LIMIT 20 will return 20 rows. It might skip 15th best, and return 21st best instead (because the search is approximate), but it'll return 20 rows.

            Here the table only has 20 rows, so if it'll skip one, you'll get less rows in the result set. Vector index is slower than filesort when you need all rows, so the optimizer shouldn't have used it unless the LIMIT is much smaller than the table size.

            serg Sergei Golubchik added a comment - - edited It's MDEV-8306 — the index shouldn't have been used. If a table has thousand rows, then LIMIT 20 will return 20 rows. It might skip 15th best, and return 21st best instead (because the search is approximate), but it'll return 20 rows. Here the table only has 20 rows, so if it'll skip one, you'll get less rows in the result set. Vector index is slower than filesort when you need all rows, so the optimizer shouldn't have used it unless the LIMIT is much smaller than the table size.

            I'm not certain it is this simple.
            While I couldn't reproduce the exact issue described initially, I came up with a different case derived from that one. I think it is related.

            --source include/have_sequence.inc
             
            create or replace table t (a int, v vector(1) not null, vector(v));
            insert into t select seq, vec_fromtext(concat('[',seq,']')) from seq_1_to_200;
            update t set v = vec_fromtext(concat('[33]')) where a <= 15;
            select a, vec_totext(v) from t order by vec_distance_euclidean(v,vec_fromtext('[33]')) limit 25;
             
            # Cleanup
            drop table t;
            

            7fce19bd215ac0671855044520092aa4210049d1

            MariaDB [test]> select a, vec_totext(v) from t order by vec_distance_euclidean(v,vec_fromtext('[33]')) limit 25;
            +------+---------------+
            | a    | vec_totext(v) |
            +------+---------------+
            |    4 | [33]          |
            |   15 | [33]          |
            |   10 | [33]          |
            |   33 | [33]          |
            |    8 | [33]          |
            |    5 | [33]          |
            |    1 | [33]          |
            |    2 | [33]          |
            |    3 | [33]          |
            |    6 | [33]          |
            |    9 | [33]          |
            |   11 | [33]          |
            |   14 | [33]          |
            |   13 | [33]          |
            |   12 | [33]          |
            |    7 | [33]          |
            +------+---------------+
            16 rows in set (0.002 sec)
            

            So, here the table has 200 rows total, limit is 25 (which should take MDEV-8306 off the table), but the query only returns 16 rows (and it will return 15 if I use [0] instead of [33], I took the latter to see if it will actually find anything that was there before UPDATE). Does it still fit your previous explanation?

            To me it looks more like something special about having exact matches maybe, but of course I'm just guessing here.

            elenst Elena Stepanova added a comment - I'm not certain it is this simple. While I couldn't reproduce the exact issue described initially, I came up with a different case derived from that one. I think it is related. --source include/have_sequence.inc   create or replace table t (a int , v vector(1) not null , vector(v)); insert into t select seq, vec_fromtext(concat( '[' ,seq, ']' )) from seq_1_to_200; update t set v = vec_fromtext(concat( '[33]' )) where a <= 15; select a, vec_totext(v) from t order by vec_distance_euclidean(v,vec_fromtext( '[33]' )) limit 25;   # Cleanup drop table t; 7fce19bd215ac0671855044520092aa4210049d1 MariaDB [test]> select a, vec_totext(v) from t order by vec_distance_euclidean(v,vec_fromtext( '[33]' )) limit 25; + ------+---------------+ | a | vec_totext(v) | + ------+---------------+ | 4 | [33] | | 15 | [33] | | 10 | [33] | | 33 | [33] | | 8 | [33] | | 5 | [33] | | 1 | [33] | | 2 | [33] | | 3 | [33] | | 6 | [33] | | 9 | [33] | | 11 | [33] | | 14 | [33] | | 13 | [33] | | 12 | [33] | | 7 | [33] | + ------+---------------+ 16 rows in set (0.002 sec) So, here the table has 200 rows total, limit is 25 (which should take MDEV-8306 off the table), but the query only returns 16 rows (and it will return 15 if I use [0] instead of [33] , I took the latter to see if it will actually find anything that was there before UPDATE). Does it still fit your previous explanation? To me it looks more like something special about having exact matches maybe, but of course I'm just guessing here.

            Yes, absolutely, this is definitely expected, it's how the streaming mode works. It tends to skip vectors that have exactly the same distance to the target. Which is your case here. And it is basically never the case in real world data. The algorithm is tuned and optimized for real world use cases, and does not perform well on artificial ones.

            serg Sergei Golubchik added a comment - Yes, absolutely, this is definitely expected, it's how the streaming mode works. It tends to skip vectors that have exactly the same distance to the target. Which is your case here. And it is basically never the case in real world data. The algorithm is tuned and optimized for real world use cases, and does not perform well on artificial ones.
            elenst Elena Stepanova added a comment - - edited

            I don't understand. It didn't skip vectors which have exactly the same distance to the target, it returned only vectors which have exactly the same distance to the target (zero), and skipped everything else.

            Of course it can happen in real life that records have identical vectors – duplicate records, plagiary, different sources quoting the same text, and so on. I think it's more unrealistic to expect that in the large amount of data all vectors will always be unique.

            elenst Elena Stepanova added a comment - - edited I don't understand. It didn't skip vectors which have exactly the same distance to the target, it returned only vectors which have exactly the same distance to the target (zero), and skipped everything else. Of course it can happen in real life that records have identical vectors – duplicate records, plagiary, different sources quoting the same text, and so on. I think it's more unrealistic to expect that in the large amount of data all vectors will always be unique.

            Of course it can happen in real life that records have identical vectors – duplicate records, plagiary, different sources quoting the same text, and so on. I think it's more unrealistic to expect that in the large amount of data all vectors will always be unique.

            Yes. And it does't always skip them. It can but doesn't have to. You're saying, basically, that identical vectors are caused by identical documents. In that case, potentially omitting some of them from the output (which then will be used as a context for RAG) is acceptable. Or may be not, benchmark is still running. So this behavior might still change.

            serg Sergei Golubchik added a comment - Of course it can happen in real life that records have identical vectors – duplicate records, plagiary, different sources quoting the same text, and so on. I think it's more unrealistic to expect that in the large amount of data all vectors will always be unique. Yes. And it does't always skip them. It can but doesn't have to. You're saying, basically, that identical vectors are caused by identical documents. In that case, potentially omitting some of them from the output (which then will be used as a context for RAG) is acceptable. Or may be not, benchmark is still running. So this behavior might still change.

            In that case, potentially omitting some of them from the output (which then will be used as a context for RAG) is acceptable.

            I quite agree, skipping identical documents and returning next alternatives instead makes a lot of sense (I'm not sure it is acceptable when it leads to returning less rows than requested, but let it be for now).
            But as I said before, in the example above it doesn't skip identical rows, it does the opposite – it returns only identical rows and skips everything else.

            The logical outcomes would be
            a) straightforward: return 16 identical (closest) rows and then 9 other good enough results;
            b) "skip vectors that have exactly the same distance to the target" logic: return 1 row out of 16 identical ones, plus 24 other good enough results;
            c) something in between (it's approximate after all): return N rows out of 16 identical ones (skipping 16-N), plus 25-N other good enough results;
            d) flawed "something in between" logic (affecting the size of the result set): find N identical rows and 25-N other good enough results, after that throw away N-1 duplicates, thus returning 25-N+1 rows, out of which one is a representative of the "identical" group, and 25-N are other good enough results.

            Any of them could be explained, but I don't see how "return 16 identical rows and nothing else when 25 rows are requested" can be justified.

            elenst Elena Stepanova added a comment - In that case, potentially omitting some of them from the output (which then will be used as a context for RAG) is acceptable. I quite agree, skipping identical documents and returning next alternatives instead makes a lot of sense (I'm not sure it is acceptable when it leads to returning less rows than requested, but let it be for now). But as I said before, in the example above it doesn't skip identical rows, it does the opposite – it returns only identical rows and skips everything else . The logical outcomes would be a) straightforward: return 16 identical (closest) rows and then 9 other good enough results; b) "skip vectors that have exactly the same distance to the target" logic: return 1 row out of 16 identical ones, plus 24 other good enough results; c) something in between (it's approximate after all): return N rows out of 16 identical ones (skipping 16-N), plus 25-N other good enough results; d) flawed "something in between" logic (affecting the size of the result set): find N identical rows and 25-N other good enough results, after that throw away N-1 duplicates, thus returning 25-N+1 rows, out of which one is a representative of the "identical" group, and 25-N are other good enough results. Any of them could be explained, but I don't see how "return 16 identical rows and nothing else when 25 rows are requested" can be justified.

            People

              serg Sergei Golubchik
              elenst Elena Stepanova
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.