Details

    • Epic
    • Status: Stalled (View Workflow)
    • Critical
    • Resolution: Unresolved
    • N/A
    • Server
    • vector search

    Description

      A common operation for multi-dimensional vectors is to find k nearest vectors to the given one.
      This task is about implementing indexes that allow to do it fast.

      • ideally they'll be engine independent
      • indexes should be update-able
      • in this task we'll only do Euclidean distance
      • we'll benchmark it on real multi-million-rows data sets
      • what algorithm, exactly, to use is still unclear

      Attachments

        Issue Links

          Activity

            serg Sergei Golubchik created issue -
            serg Sergei Golubchik made changes -
            Field Original Value New Value
            serg Sergei Golubchik made changes -
            Description A common operation for multi-dimensional vectors is to find k nearest vectors to the given one.
            This task is about implementing indexes that allow to do it fast.

            * ideally they'll be engine independent
            * indexes should be update-able
            * in this task we'll only do Euclidean distance
            * we'll benchmark it on real multi-million-rows data sets

            * what algorithm, exactly, to use is still unclear
            A common operation for multi-dimensional vectors is to find k nearest vectors to the given one.
            This task is about implementing indexes that allow to do it fast.

            * ideally they'll be engine independent
            * indexes should be update-able
            * in this task we'll only do Euclidean distance
            * we'll benchmark it on real multi-million-rows data sets
            * optimizer-wise we'll do like with fulltext search

            * what algorithm, exactly, to use is still unclear
            serg Sergei Golubchik made changes -
            Fix Version/s 11.5 [ 29506 ]
            serg Sergei Golubchik made changes -
            Status Open [ 1 ] In Progress [ 3 ]

            parser, grammar. CREATE TABLE works,

            serg Sergei Golubchik added a comment - parser, grammar. CREATE TABLE works,

            metadata is stored in the frm, read from frm, type checks (only BLOB and NOT NULL)
            SHOW commands and INFORMATION_SCHEMA.STATISTICS

            serg Sergei Golubchik added a comment - metadata is stored in the frm, read from frm, type checks (only BLOB and NOT NULL ) SHOW commands and INFORMATION_SCHEMA.STATISTICS

            An API for a graph algorithm.

            Algorithm implementation needs to provide methods for

            • add a vector to the graph
            • remove a vector from a graph
            • update — only if there will be some algorithm that can do it much faster that delete+insert
            • build a graph from a set of vectors

            The server will provide methods for

            • read/update list of edges for a specific node
            serg Sergei Golubchik added a comment - An API for a graph algorithm. Algorithm implementation needs to provide methods for add a vector to the graph remove a vector from a graph update — only if there will be some algorithm that can do it much faster that delete+insert build a graph from a set of vectors The server will provide methods for read/update list of edges for a specific node
            serg Sergei Golubchik made changes -
            Comment [ (/) index storage: first version. inside the table row, in the internal (not visible to a user) column

            won't be performant, but allows to implement and debug the algorithm

            the engine doesn't see the key at all, so enable/disable keys likely won't work. but as we'll probably change the storage in the future, it doesn't matter now. ]
            serg Sergei Golubchik made changes -
            serg Sergei Golubchik made changes -
            Description A common operation for multi-dimensional vectors is to find k nearest vectors to the given one.
            This task is about implementing indexes that allow to do it fast.

            * ideally they'll be engine independent
            * indexes should be update-able
            * in this task we'll only do Euclidean distance
            * we'll benchmark it on real multi-million-rows data sets
            * optimizer-wise we'll do like with fulltext search

            * what algorithm, exactly, to use is still unclear
            A common operation for multi-dimensional vectors is to find k nearest vectors to the given one.
            This task is about implementing indexes that allow to do it fast.

            * ideally they'll be engine independent
            * indexes should be update-able
            * in this task we'll only do Euclidean distance
            * we'll benchmark it on real multi-million-rows data sets
            * optimizer-wise we'll do like with fulltext search

            * what algorithm, exactly, to use is still unclear

            h1. this task is at the moment in the _early design phase_ the ideas are recorded, added, changed, and -removed- in the linked gdoc
            serg Sergei Golubchik made changes -
            Description A common operation for multi-dimensional vectors is to find k nearest vectors to the given one.
            This task is about implementing indexes that allow to do it fast.

            * ideally they'll be engine independent
            * indexes should be update-able
            * in this task we'll only do Euclidean distance
            * we'll benchmark it on real multi-million-rows data sets
            * optimizer-wise we'll do like with fulltext search

            * what algorithm, exactly, to use is still unclear

            h1. this task is at the moment in the _early design phase_ the ideas are recorded, added, changed, and -removed- in the linked gdoc
            A common operation for multi-dimensional vectors is to find k nearest vectors to the given one.
            This task is about implementing indexes that allow to do it fast.

            * ideally they'll be engine independent
            * indexes should be update-able
            * in this task we'll only do Euclidean distance
            * we'll benchmark it on real multi-million-rows data sets
            * optimizer-wise we'll do like with fulltext search

            * what algorithm, exactly, to use is still unclear

            h1. *this task is at the moment in the _early design phase_ the ideas are recorded, added, changed, and -removed- in the linked gdoc*
            serg Sergei Golubchik made changes -
            Description A common operation for multi-dimensional vectors is to find k nearest vectors to the given one.
            This task is about implementing indexes that allow to do it fast.

            * ideally they'll be engine independent
            * indexes should be update-able
            * in this task we'll only do Euclidean distance
            * we'll benchmark it on real multi-million-rows data sets
            * optimizer-wise we'll do like with fulltext search

            * what algorithm, exactly, to use is still unclear

            h1. *this task is at the moment in the _early design phase_ the ideas are recorded, added, changed, and -removed- in the linked gdoc*
            A common operation for multi-dimensional vectors is to find k nearest vectors to the given one.
            This task is about implementing indexes that allow to do it fast.

            * ideally they'll be engine independent
            * indexes should be update-able
            * in this task we'll only do Euclidean distance
            * we'll benchmark it on real multi-million-rows data sets
            * optimizer-wise we'll do like with fulltext search

            * what algorithm, exactly, to use is still unclear

            h1. *this task is at the moment in the _early design phase_ the ideas are recorded, added, changed, and -removed- *in the linked gdoc*
            serg Sergei Golubchik made changes -
            Description A common operation for multi-dimensional vectors is to find k nearest vectors to the given one.
            This task is about implementing indexes that allow to do it fast.

            * ideally they'll be engine independent
            * indexes should be update-able
            * in this task we'll only do Euclidean distance
            * we'll benchmark it on real multi-million-rows data sets
            * optimizer-wise we'll do like with fulltext search

            * what algorithm, exactly, to use is still unclear

            h1. *this task is at the moment in the _early design phase_ the ideas are recorded, added, changed, and -removed- *in the linked gdoc*
            A common operation for multi-dimensional vectors is to find k nearest vectors to the given one.
            This task is about implementing indexes that allow to do it fast.

            * ideally they'll be engine independent
            * indexes should be update-able
            * in this task we'll only do Euclidean distance
            * we'll benchmark it on real multi-million-rows data sets
            * optimizer-wise we'll do like with fulltext search

            * what algorithm, exactly, to use is still unclear

            h1. *this task is at the moment in the _early design phase_ the ideas are recorded, added, changed, and -removed- in the linked gdoc*
            serg Sergei Golubchik made changes -
            Description A common operation for multi-dimensional vectors is to find k nearest vectors to the given one.
            This task is about implementing indexes that allow to do it fast.

            * ideally they'll be engine independent
            * indexes should be update-able
            * in this task we'll only do Euclidean distance
            * we'll benchmark it on real multi-million-rows data sets
            * optimizer-wise we'll do like with fulltext search

            * what algorithm, exactly, to use is still unclear

            h1. *this task is at the moment in the _early design phase_ the ideas are recorded, added, changed, and -removed- in the linked gdoc*
            A common operation for multi-dimensional vectors is to find k nearest vectors to the given one.
            This task is about implementing indexes that allow to do it fast.

            * ideally they'll be engine independent
            * indexes should be update-able
            * in this task we'll only do Euclidean distance
            * we'll benchmark it on real multi-million-rows data sets
            * optimizer-wise we'll do like with fulltext search

            * what algorithm, exactly, to use is still unclear
            serg Sergei Golubchik made changes -
            serg Sergei Golubchik made changes -
            serg Sergei Golubchik made changes -
            serg Sergei Golubchik made changes -
            serg Sergei Golubchik made changes -
            serg Sergei Golubchik made changes -
            serg Sergei Golubchik made changes -
            serg Sergei Golubchik made changes -
            serg Sergei Golubchik made changes -
            Description A common operation for multi-dimensional vectors is to find k nearest vectors to the given one.
            This task is about implementing indexes that allow to do it fast.

            * ideally they'll be engine independent
            * indexes should be update-able
            * in this task we'll only do Euclidean distance
            * we'll benchmark it on real multi-million-rows data sets
            * optimizer-wise we'll do like with fulltext search

            * what algorithm, exactly, to use is still unclear
            A common operation for multi-dimensional vectors is to find k nearest vectors to the given one.
            This task is about implementing indexes that allow to do it fast.

            * ideally they'll be engine independent
            * indexes should be update-able
            * in this task we'll only do Euclidean distance
            * we'll benchmark it on real multi-million-rows data sets

            * what algorithm, exactly, to use is still unclear
            serg Sergei Golubchik made changes -
            serg Sergei Golubchik made changes -
            serg Sergei Golubchik made changes -
            serg Sergei Golubchik made changes -
            serg Sergei Golubchik made changes -
            serg Sergei Golubchik made changes -
            serg Sergei Golubchik made changes -
            serg Sergei Golubchik made changes -
            serg Sergei Golubchik made changes -
            serg Sergei Golubchik made changes -
            Fix Version/s 11.6 [ 29515 ]
            Fix Version/s 11.5 [ 29506 ]
            JIraAutomate JiraAutomate made changes -
            Status In Progress [ 3 ] Stalled [ 10000 ]
            serg Sergei Golubchik made changes -
            serg Sergei Golubchik made changes -

            From the product point of view, this is critical.

            julien.fritsch Julien Fritsch added a comment - From the product point of view, this is critical.
            julien.fritsch Julien Fritsch made changes -
            Priority Major [ 3 ] Critical [ 2 ]
            serg Sergei Golubchik made changes -
            serg Sergei Golubchik made changes -
            serg Sergei Golubchik made changes -
            serg Sergei Golubchik made changes -
            serg Sergei Golubchik made changes -
            serg Sergei Golubchik made changes -
            serg Sergei Golubchik made changes -
            serg Sergei Golubchik made changes -
            Status Stalled [ 10000 ] In Progress [ 3 ]
            serg Sergei Golubchik made changes -
            serg Sergei Golubchik made changes -
            serg Sergei Golubchik made changes -
            Fix Version/s 11.7 [ 29815 ]
            Fix Version/s 11.6 [ 29515 ]
            serg Sergei Golubchik made changes -
            Fix Version/s 11.8 [ 29921 ]
            Fix Version/s 11.7 [ 29815 ]
            sanja Oleksandr Byelkin added a comment - - edited

            Here is my review:
            This

            share->total_keys > share->keys
            to->s->keys == to->s->total_keys
            

            better cnvert to SHARE::has_hlindex() or something like that

            I am not sure about this changes (at least buffer size should be increaed to
            better reflect default length of vector test representation):

            diff --git a/sql/item_vectorfunc.cc b/sql/item_vectorfunc.cc
            index 8b9ea188490..315ad4e06fa 100644
            --- a/sql/item_vectorfunc.cc
            +++ b/sql/item_vectorfunc.cc
            @@ -134,6 +134,9 @@ String *Item_func_vec_fromtext::val_str(String *buf)
             {
               json_engine_t je;
               bool end_ok= false;
            +  char buff[STRING_BUFFER_USUAL_SIZE]; // maynbe *2 or *3 has more sens
            +                                       // (usual vector representation)
            +  String tmp_js(buff,sizeof(buff), &my_charset_bin);
               String *value = args[0]->val_json(&tmp_js);
               CHARSET_INFO *cs= value->charset();
             
            diff --git a/sql/item_vectorfunc.h b/sql/item_vectorfunc.h
            index 58dc300c451..1687eed7e29 100644
            --- a/sql/item_vectorfunc.h
            +++ b/sql/item_vectorfunc.h
            @@ -136,7 +136,6 @@ class Item_func_vec_totext: public Item_str_ascii_checksum_func
             
             class Item_func_vec_fromtext: public Item_str_func
             {
            -  String tmp_js;
             public:
               bool fix_length_and_dec(THD *thd) override;
               Item_func_vec_fromtext(THD *thd, Item *a);
            

            Should be tested with maribackup (high posibility of fail with the new ndex)

            Should be tested with mariadb-dump (probably pass, but will check that what
            was printed by SHOW CREATE TABLE willbe read back)

            It would be nice to have test with different types of indexses in one table
            and they used by optimiser (I also do not expect problems, just to have it
            checked).

            IMHO VECTOR(N) type should be done before release to avoid table converting
            on upgrade in the future.


            IMHO this cleanups a bit fixes and better be backported if possible:

            cleanup: remove unconditional #ifdef's

            cleanup: key algorithm vs key flags

            sanja Oleksandr Byelkin added a comment - - edited Here is my review: This share->total_keys > share->keys to->s->keys == to->s->total_keys better cnvert to SHARE::has_hlindex() or something like that I am not sure about this changes (at least buffer size should be increaed to better reflect default length of vector test representation): diff --git a/sql/item_vectorfunc.cc b/sql/item_vectorfunc.cc index 8b9ea188490..315ad4e06fa 100644 --- a/sql/item_vectorfunc.cc +++ b/sql/item_vectorfunc.cc @@ -134,6 +134,9 @@ String *Item_func_vec_fromtext::val_str(String *buf) { json_engine_t je; bool end_ok= false; + char buff[STRING_BUFFER_USUAL_SIZE]; // maynbe *2 or *3 has more sens + // (usual vector representation) + String tmp_js(buff,sizeof(buff), &my_charset_bin); String *value = args[0]->val_json(&tmp_js); CHARSET_INFO *cs= value->charset(); diff --git a/sql/item_vectorfunc.h b/sql/item_vectorfunc.h index 58dc300c451..1687eed7e29 100644 --- a/sql/item_vectorfunc.h +++ b/sql/item_vectorfunc.h @@ -136,7 +136,6 @@ class Item_func_vec_totext: public Item_str_ascii_checksum_func class Item_func_vec_fromtext: public Item_str_func { - String tmp_js; public: bool fix_length_and_dec(THD *thd) override; Item_func_vec_fromtext(THD *thd, Item *a); Should be tested with maribackup (high posibility of fail with the new ndex) Should be tested with mariadb-dump (probably pass, but will check that what was printed by SHOW CREATE TABLE willbe read back) It would be nice to have test with different types of indexses in one table and they used by optimiser (I also do not expect problems, just to have it checked). IMHO VECTOR(N) type should be done before release to avoid table converting on upgrade in the future. IMHO this cleanups a bit fixes and better be backported if possible: cleanup: remove unconditional #ifdef's cleanup: key algorithm vs key flags
            serg Sergei Golubchik made changes -
            serg Sergei Golubchik made changes -
            serg Sergei Golubchik made changes -
            serg Sergei Golubchik made changes -
            serg Sergei Golubchik made changes -
            serg Sergei Golubchik made changes -
            serg Sergei Golubchik made changes -
            serg Sergei Golubchik made changes -
            serg Sergei Golubchik made changes -
            serg Sergei Golubchik made changes -
            serg Sergei Golubchik made changes -
            Fix Version/s 11.8 [ 29921 ]
            serg Sergei Golubchik made changes -
            Issue Type New Feature [ 2 ] Epic [ 5 ]
            serg Sergei Golubchik made changes -
            Status In Progress [ 3 ] Stalled [ 10000 ]
            serg Sergei Golubchik made changes -
            Fix Version/s N/A [ 14700 ]
            serg Sergei Golubchik made changes -
            serg Sergei Golubchik made changes -
            Summary k-ANN indexes for vectors vector search
            serg Sergei Golubchik made changes -
            serg Sergei Golubchik made changes -
            serg Sergei Golubchik made changes -
            serg Sergei Golubchik made changes -
            serg Sergei Golubchik made changes -
            serg Sergei Golubchik made changes -
            svoj Sergey Vojtovich made changes -
            julien.fritsch Julien Fritsch made changes -
            Epic Name vectore search
            julien.fritsch Julien Fritsch made changes -
            Epic Name vectore search vector search
            adamluciano Adam Luciano made changes -
            Epic Child MDEV-36568 [ 133790 ]
            adamluciano Adam Luciano made changes -
            Epic Child MDEV-36605 [ 133866 ]

            People

              serg Sergei Golubchik
              serg Sergei Golubchik
              Votes:
              2 Vote for this issue
              Watchers:
              19 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.