Details

    Description

      An umbrella task for all vector search features that are planned to make it into 11.7

      Attachments

        Issue Links

          Activity

            elenst Elena Stepanova added a comment - Branch bb-11.6-MDEV-32887-vector
            elenst Elena Stepanova added a comment - - edited

            In my opinion, the feature in its current shape can be pushed into the main branch and released with 11.7.1.

            In short, it appears stable enough for the RC after all the bugfixing, and we need the community to start experimenting with it on realistic datasets and use cases for possible further tuning before GA. The internal feature-focused testing will also be continued on the main/11.7 branch before and after 11.7.1 release.


            Long version:

            The main shortage of internal feature testing in this case was (and still is) that there is no usable criteria/requirements for "sufficient result correctness".

            Normally correctness is a fixed characteristic which does not cause much controversy and to a large extent can be tested on a variety of datasets, not necessarily real-life ones, while performance remains relative and measured either on standard benchmarks (with the common understanding that they don't necessarily represent realistic use cases) or, in some cases, on actual real-life scenarios.

            In case of vector search with its results being approximate by nature, we have two flexible characteristics which depend on each other (better correctness leads to worse performance and vice versa), and for neither of which we can set the hard limit "it cannot go worse than that under any circumstances" on any given dataset.

            Whatever we know now about the comparative performance/recall of the current implementation was already presented in public talks and blog posts by feature developers. This stage of internal testing was mainly focused on stability and other less controversial aspects of the feature. I cannot claim such testing to be sufficient and I don't believe it will ever be, which is why I think it is important to get the feature out to the public and gather as much information as possible about what users consider more important in which cases, how much precision can be sacrificed for the sake of performance, and so on. I expect there will always be a fair amount of dissatisfaction as different use cases have different requirements, but hopefully we will get a bigger picture than we have now.

            Meanwhile, below are some notes from the testing, mostly for documentation and other "user must be aware" purposes.

            I won't list those limitations or issues which are immediately obvious, only some which can remain unnoticed but cause troubles. The list is dynamic, so some notes can become outdated quickly. In no particular order.

            • vector key is not used for ORDER BY .. DESC – the query will work, but full scan will be performed (MDEV-35296: fixed by disabling);
            • vector key is not used in UPDATE and DELETE – the query will work, but full scan will be performed (MDEV-35161);
            • vector key is not used in FROM subqueries / views;
            • vector key is used only when ORDER BY VEC_Distance_<xxx>(<col>,<constant>) LIMIT <x> – that is, not any expression involving it, nor a wrapping function, etc.;
            • InnoDB bulk insert does not work for tables with vector key – data loading can be not as fast as expected (MDEV-35287, MDEV-35130: fixed by disabling);
            • IMPORT TABLESPACE does not work, although allowed – can cause unexpected errors later (MDEV-35069);
            • DATA/INDEX DIRECTORY options are ignored for vector index – files can end up in a different location than the user was planning (MDEV-35152);
            • IGNORED attribute has no effect on vector keys – using it in experiments can lead to wrong conclusions (MDEV-35186);
            • non-default distance and M are not replicated – the replica can end up with a different index structure than assumed, and the search won't use the index (MDEV-35320);
            • optimizer doesn't / cannot take into account M and ef_search – vector search can turn out to be very non-optimal comparing to other possible plans;
            • VEC_ToText can return a text representation of invalid vectors – it can be confusing that the JSON looks okay, but the value cannot be inserted (MDEV-35210, fixed with the note "VEC_ToText still prints everything");
            • views involving vector functions may lead to non-working dumps produced by mariadb-dump (MDEV-35286, the bug was filed for GIS, vector has the same problem);
            • distance functions on vectors of different length return NULL without warnings (MDEV-35192, an error which is easy to make in an application and which can cause unexpected results as the search will become fully random);
            • XA involving vectors may cause replication errors (MDEV-35271, MDEV-35196);
            • myisampack should be avoided for now (MDEV-35198);
            • DROP TABLE on a table with vector key should be performed carefully, either preventing possible failures or doing cleanup afterwards, as it is not atomic, can lead to corruption-like errors (MDEV-35241);
            • for ALTER TABLE on tables with vector key, better to use explicit ALGORITHM=COPY, as a non-copying algorithm may otherwise be chosen by default and cause issues (MDEV-35292, MDEV-35338);
            • for tables with vector keys, engines other than InnoDB, Aria, and MyISAM should better be avoided for now even if they seem to accept table creation (Spider and Mroonga are known to have issues, most of other engines will be rejected right away);
            • when vectors are involved, mariadb-dump should be run with --hex-dump option, otherwise the data can be lost (MDEV-35221)
            • while experimenting with mhnsw_ef_search at runtime, make sure that query cache is disabled, otherwise there will be no expected effect (MDEV-35185)
            elenst Elena Stepanova added a comment - - edited In my opinion, the feature in its current shape can be pushed into the main branch and released with 11.7.1. In short, it appears stable enough for the RC after all the bugfixing, and we need the community to start experimenting with it on realistic datasets and use cases for possible further tuning before GA. The internal feature-focused testing will also be continued on the main/11.7 branch before and after 11.7.1 release. Long version: The main shortage of internal feature testing in this case was (and still is) that there is no usable criteria/requirements for "sufficient result correctness". Normally correctness is a fixed characteristic which does not cause much controversy and to a large extent can be tested on a variety of datasets, not necessarily real-life ones, while performance remains relative and measured either on standard benchmarks (with the common understanding that they don't necessarily represent realistic use cases) or, in some cases, on actual real-life scenarios. In case of vector search with its results being approximate by nature, we have two flexible characteristics which depend on each other (better correctness leads to worse performance and vice versa), and for neither of which we can set the hard limit "it cannot go worse than that under any circumstances" on any given dataset. Whatever we know now about the comparative performance/recall of the current implementation was already presented in public talks and blog posts by feature developers. This stage of internal testing was mainly focused on stability and other less controversial aspects of the feature. I cannot claim such testing to be sufficient and I don't believe it will ever be, which is why I think it is important to get the feature out to the public and gather as much information as possible about what users consider more important in which cases, how much precision can be sacrificed for the sake of performance, and so on. I expect there will always be a fair amount of dissatisfaction as different use cases have different requirements, but hopefully we will get a bigger picture than we have now. Meanwhile, below are some notes from the testing, mostly for documentation and other "user must be aware" purposes. I won't list those limitations or issues which are immediately obvious, only some which can remain unnoticed but cause troubles. The list is dynamic, so some notes can become outdated quickly. In no particular order. vector key is not used for ORDER BY .. DESC – the query will work, but full scan will be performed ( MDEV-35296 : fixed by disabling); vector key is not used in UPDATE and DELETE – the query will work, but full scan will be performed ( MDEV-35161 ); vector key is not used in FROM subqueries / views; vector key is used only when ORDER BY VEC_Distance_<xxx>(<col>,<constant>) LIMIT <x> – that is, not any expression involving it, nor a wrapping function, etc.; InnoDB bulk insert does not work for tables with vector key – data loading can be not as fast as expected ( MDEV-35287 , MDEV-35130 : fixed by disabling); IMPORT TABLESPACE does not work, although allowed – can cause unexpected errors later ( MDEV-35069 ); DATA/INDEX DIRECTORY options are ignored for vector index – files can end up in a different location than the user was planning ( MDEV-35152 ); IGNORED attribute has no effect on vector keys – using it in experiments can lead to wrong conclusions ( MDEV-35186 ); non-default distance and M are not replicated – the replica can end up with a different index structure than assumed, and the search won't use the index ( MDEV-35320 ); optimizer doesn't / cannot take into account M and ef_search – vector search can turn out to be very non-optimal comparing to other possible plans; VEC_ToText can return a text representation of invalid vectors – it can be confusing that the JSON looks okay, but the value cannot be inserted ( MDEV-35210 , fixed with the note "VEC_ToText still prints everything"); views involving vector functions may lead to non-working dumps produced by mariadb-dump ( MDEV-35286 , the bug was filed for GIS, vector has the same problem); distance functions on vectors of different length return NULL without warnings ( MDEV-35192 , an error which is easy to make in an application and which can cause unexpected results as the search will become fully random); XA involving vectors may cause replication errors ( MDEV-35271 , MDEV-35196 ); myisampack should be avoided for now ( MDEV-35198 ); DROP TABLE on a table with vector key should be performed carefully, either preventing possible failures or doing cleanup afterwards, as it is not atomic, can lead to corruption-like errors ( MDEV-35241 ); for ALTER TABLE on tables with vector key, better to use explicit ALGORITHM=COPY , as a non-copying algorithm may otherwise be chosen by default and cause issues ( MDEV-35292 , MDEV-35338 ); for tables with vector keys, engines other than InnoDB, Aria, and MyISAM should better be avoided for now even if they seem to accept table creation (Spider and Mroonga are known to have issues, most of other engines will be rejected right away); when vectors are involved, mariadb-dump should be run with --hex-dump option, otherwise the data can be lost ( MDEV-35221 ) while experimenting with mhnsw_ef_search at runtime, make sure that query cache is disabled, otherwise there will be no expected effect ( MDEV-35185 )

            People

              elenst Elena Stepanova
              serg Sergei Golubchik
              Votes:
              2 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.