Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-39060

a tool to configure vector search

    XMLWordPrintable

Details

    • New Feature
    • Status: Open (View Workflow)
    • Major
    • Resolution: Unresolved
    • None
    • None

    Description

      Vector search has a set of configurable parameters. Some of them affect the speed of the search and the recall. Others affect the speed of inserting and the recall. Some affect the memory and storage consumption, others don't. Usually the longer the insert (or search) takes the higher the recall is, so there's always a tradeoff. Also, one can make the search slower and the inserting faster with the same recall. Or the vice versa, the inserting slower, but the search faster. It's another tradeoff. The optimal parameter values depend on the actual data and are fairly impossible to guess. We need a way to help users to tune their vector search parameters.

      Luckily, it's rather easy to do, conceptually. Even if very slow. The user need to load the data — the larger the sample is, the better. And collect actual queries, that is query vectors, again the more the better. After that the tool will perform the search using provided queries in the provided data without using an index — to obtain exact set of nearest neighbors — and then it could tune the index for the optimal recall and speed, as directed by the user.

      It should do about the following:

      • get vectors from the user application. The user specifies the table, the vector column contains the data.
      • likely it would be more convenient to copy them to a separate table to not touch the application. So, create a table, say

        create table data (id bigint unsigned auto_increment primary key, v vector(...));
        

      • create a query table

        create table query like data;
        

      • The user specifies what percentage of data to use, by default it's 100%. Copy rows from the user table to the data table using the Bernoulli sampling. Copy user-provided query vectors to the query table. If the user did not provide query vectors, get them from the data table.
      • the user provides a distance and a set of M and ef_search values, by default, say, it's (4, 5, 6, 8, 12, 16, 24, 32, 48) and (10, 20, 30, 40). The smallest ef_search is K.
      • generate ground truth searching for every query vector in the data table (using "cosine" as an example):

        select id from data order by vec_distance_cosine(v, @q) limit K
        

      • for every value of M
        • build an index, time it
        • for every value of ef_search
          • search every query vector, time it, compute recall
      • provide results in form of a report table

        create table report (
          M int not null,
          ef_search int not null,
          recall double not null,
          built_time time(6) not null,
          qps double not null
        );
        

      • or in some other form, e.g. html (see attached report.html)

      Questions:

      • should it be done purely in SQL? Or in some scripting language like Perl or Python?
        • SQL would make it the most portable and easy to use. Windows users won't need to install Perl/Python, cloud users won't need a shell access or direct DB access.
        • external tool though can have more features and could be relatively easy extended to support other databases.

      Attachments

        Issue Links

          Activity

            People

              serg Sergei Golubchik
              serg Sergei Golubchik
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.