Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-35450

VEC_DISTANCE() function to autouse the available index type

Details

    Description

      With multiple implementations of vector index types we're burdening the application and the database to be in sync over the type of index. A wrong step means there will be a table scan.

      Simplifying it for the user is a generic VEC_DISTANCE function that will operate on the index that's there.

      If two indexes are possible I guess we select the first

      Attachments

        Issue Links

          Activity

            branch bb-11.8-MDEV-35450-vec_distance

            serg Sergei Golubchik added a comment - branch bb-11.8- MDEV-35450 -vec_distance

            OK to push after dividing the test separating "binary" results.

            sanja Oleksandr Byelkin added a comment - OK to push after dividing the test separating "binary" results.

            In the current implementation (as of adac8f1b4), an attempt to use VEC_DISTANCE fails if a vector index isn't found. Wouldn't it be logical for it to work always, falling back to mhnsw_default_distance if there is no index? VEC_DISTANCE[_xxx] has semantics beyond just ORDER BY .. LIMIT, and can be used for comparing variables, literals, view columns etc., when the index does not exist or is not detectable. Also it would make it more meaningful (and flexible) as a virtual column / check constraint function than it is now, when VEC_DISTANCE is either rejected if there is no index, or is silently converted into VEC_DISTANCE_xxx if there is an index (and if the index is later changed, the functions will diverge, which may come as a surprise to the user).

            elenst Elena Stepanova added a comment - In the current implementation (as of adac8f1b4), an attempt to use VEC_DISTANCE fails if a vector index isn't found. Wouldn't it be logical for it to work always, falling back to mhnsw_default_distance if there is no index? VEC_DISTANCE[_xxx] has semantics beyond just ORDER BY .. LIMIT , and can be used for comparing variables, literals, view columns etc., when the index does not exist or is not detectable. Also it would make it more meaningful (and flexible) as a virtual column / check constraint function than it is now, when VEC_DISTANCE is either rejected if there is no index, or is silently converted into VEC_DISTANCE_xxx if there is an index (and if the index is later changed, the functions will diverge, which may come as a surprise to the user).

            using mhnsw_default_distance is arguable. it kind of makes sense, but considering MDEV-35724 I'd rather not implement a default fallback now. Better to have an error, which is reported as a bug and fixed, than to have the function silently working or silently returning incorrect results, depending on the session variable.

            serg Sergei Golubchik added a comment - using mhnsw_default_distance is arguable. it kind of makes sense, but considering MDEV-35724 I'd rather not implement a default fallback now. Better to have an error, which is reported as a bug and fixed, than to have the function silently working or silently returning incorrect results, depending on the session variable.

            I agree that using mhnsw_default_distance is arguable – that is, whichever decision we make, it is likely that some users would find it counter-intuitive and argue against it.

            Given that it is easier to relax the rules later than to make them stricter (new stricter rules cause failures in OM=>NS replication etc.), it is probably indeed better to keep it the way it is for now.

            Assuming that it is not going to change now, In my opinion, the feature can be pushed into the main branch and released with 11.8.1, as of 57e669f1 plus a fix for MDEV-35885. I hope it will be a test-only change, in which case re-testing after the fix is not necessary; but it makes the CI red, so it must be fixed before the push (it may escape branch protection, as it's a 32-bit failure).

            For possible documentation and/or user awareness,

            the behavior may seem inconsistent in some ways, both in relation to the lack of the fallback and otherwise, e.g.

            • VEC_DISTANCE is rejected if the index is not found, assuming a user mistake, but VEC_DISTANCE_xxx works if there is no index or the index is of a different distance type, which can be a mistake just as well;
            • upon creation of a view, or a default value, or a virtual column, or a check constraint using the function, VEC_DISTANCE is expanded and stored in the definition as VEC_DISTANCE_xxx; but we don't always store the explicit value of the distance attribute for the index itself, so for example here we end up with the table definition where the function type is hardcoded, but the index still uses the session distance:

              MariaDB [test]> set mhnsw_default_distance=default;
               
              MariaDB [test]> create or replace table t (a vector(1) not null, vector(a), d float as (vec_distance(a,0x30303030)));
               
              MariaDB [test]> show create table t \G
              *************************** 1. row ***************************
                     Table: t
              Create Table: CREATE TABLE `t` (
                `a` vector(1) NOT NULL,
                `d` float GENERATED ALWAYS AS (VEC_DISTANCE_EUCLIDEAN(`a`,0x30303030)) VIRTUAL,
                VECTOR KEY `a` (`a`)
              ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_uca1400_ai_ci
              

              So if later the table is re-created with the same definition with a different session value (or even if the index is modified later), the function used for the second column will become wrong. Of course, it would also happen if the function was defined explicitly to begin with, but that's exactly kind of flexibility one may intuitively expect from an abstraction like VEC_DISTANCE – that it will automatically use the right algorithm depending on circumstances. Note that stored routines, on the other hand, don't expand VEC_DISTANCE.

            • the use of VEC_DISTANCE always requires a vector key, even if the index is not actually used in the query.
            elenst Elena Stepanova added a comment - I agree that using mhnsw_default_distance is arguable – that is, whichever decision we make, it is likely that some users would find it counter-intuitive and argue against it. Given that it is easier to relax the rules later than to make them stricter (new stricter rules cause failures in OM=>NS replication etc.), it is probably indeed better to keep it the way it is for now. Assuming that it is not going to change now, In my opinion, the feature can be pushed into the main branch and released with 11.8.1, as of 57e669f1 plus a fix for MDEV-35885 . I hope it will be a test-only change, in which case re-testing after the fix is not necessary; but it makes the CI red, so it must be fixed before the push (it may escape branch protection, as it's a 32-bit failure). For possible documentation and/or user awareness, the behavior may seem inconsistent in some ways, both in relation to the lack of the fallback and otherwise, e.g. VEC_DISTANCE is rejected if the index is not found, assuming a user mistake, but VEC_DISTANCE_xxx works if there is no index or the index is of a different distance type, which can be a mistake just as well; upon creation of a view, or a default value, or a virtual column, or a check constraint using the function, VEC_DISTANCE is expanded and stored in the definition as VEC_DISTANCE_xxx ; but we don't always store the explicit value of the distance attribute for the index itself, so for example here we end up with the table definition where the function type is hardcoded, but the index still uses the session distance: MariaDB [test]> set mhnsw_default_distance= default ;   MariaDB [test]> create or replace table t (a vector(1) not null , vector(a), d float as (vec_distance(a,0x30303030)));   MariaDB [test]> show create table t \G *************************** 1. row *************************** Table : t Create Table : CREATE TABLE `t` ( `a` vector(1) NOT NULL , `d` float GENERATED ALWAYS AS (VEC_DISTANCE_EUCLIDEAN(`a`,0x30303030)) VIRTUAL, VECTOR KEY `a` (`a`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE =utf8mb4_uca1400_ai_ci So if later the table is re-created with the same definition with a different session value (or even if the index is modified later), the function used for the second column will become wrong. Of course, it would also happen if the function was defined explicitly to begin with, but that's exactly kind of flexibility one may intuitively expect from an abstraction like VEC_DISTANCE – that it will automatically use the right algorithm depending on circumstances. Note that stored routines, on the other hand, don't expand VEC_DISTANCE . the use of VEC_DISTANCE always requires a vector key, even if the index is not actually used in the query.

            Forgot to mention in the above comment: MTR coverage misses one line – technically not from the feature patch but from the cleanup/refactoring commits accompanying the feature, but nevertheless:

            ===File sql/item_vectorfunc.h:
                 65 : +  { return get_item_copy<Item_func_vec_distance>(thd, this); }
            

            If you want to fill the gap, the following test case should do it:

            create table t1 (a vector(1) not null, vector(a));
            create or replace algorithm=temptable view v1 as select * from t1;
            select * from v1 where vec_distance(a,0x30303030) > 0;
            drop view v1;
            drop table t1;
            

            Maybe it could be made simpler, but at least it worked for me.

            elenst Elena Stepanova added a comment - Forgot to mention in the above comment: MTR coverage misses one line – technically not from the feature patch but from the cleanup/refactoring commits accompanying the feature, but nevertheless: ===File sql/item_vectorfunc.h: 65 : + { return get_item_copy<Item_func_vec_distance>(thd, this); } If you want to fill the gap, the following test case should do it: create table t1 (a vector(1) not null , vector(a)); create or replace algorithm=temptable view v1 as select * from t1; select * from v1 where vec_distance(a,0x30303030) > 0; drop view v1; drop table t1; Maybe it could be made simpler, but at least it worked for me.

            People

              serg Sergei Golubchik
              danblack Daniel Black
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.