Details

    Description

      VECTOR(N) data type, that is an array of N floating poing numbers in some unspecified internal format (IEEE 754 or something else)

      Questions:
      Should we allow any particular operators on these?

      Attachments

        Issue Links

          Activity

            rdyas Robert Dyas added a comment -

            I'm new to this but it seems to me that a column should be defined as something like VECTOR(3072, 32) (3072 dimensions of 32 bit floats) where the 32bit is the default and is an optional paramter.

            Vector values shoud be able to be input (insert/update) via JSON syntax and when you SELECT on a vector the default syntax seems like it should be JSON (for human readable render).

            Also, it might be helpful if you can specify a col as VECTOR(dim, bit_depth, model_name) so the column definition has an optional user defined model name (maybe varchar(60) ) associated with it. The vector is so tightly tied to the model that produced it (say openai-text-embedding-3-large) that the model is conceptually part of the datatype... i.e. you have to generate embeddings in the same model to match against that vector stored in the db. Things are evolving so quickly and models are changing... seems like it would be nice to be able to track the model that applies to those vectors. Up to the user to ensure they do of course. I can see using different embedding models for different use cases and over time... having this as part of the datatype will reduce confusion.

            Please forgive me if any of what I've said is stupid, as I'm on a learning curve with all this stuff.

            rdyas Robert Dyas added a comment - I'm new to this but it seems to me that a column should be defined as something like VECTOR(3072, 32) (3072 dimensions of 32 bit floats) where the 32bit is the default and is an optional paramter. Vector values shoud be able to be input (insert/update) via JSON syntax and when you SELECT on a vector the default syntax seems like it should be JSON (for human readable render). Also, it might be helpful if you can specify a col as VECTOR(dim, bit_depth, model_name) so the column definition has an optional user defined model name (maybe varchar(60) ) associated with it. The vector is so tightly tied to the model that produced it (say openai-text-embedding-3-large) that the model is conceptually part of the datatype... i.e. you have to generate embeddings in the same model to match against that vector stored in the db. Things are evolving so quickly and models are changing... seems like it would be nice to be able to track the model that applies to those vectors. Up to the user to ensure they do of course. I can see using different embedding models for different use cases and over time... having this as part of the datatype will reduce confusion. Please forgive me if any of what I've said is stupid, as I'm on a learning curve with all this stuff.

            may be. For now we plan something like VECTOR(3072) where it's always 32-bit floats. But we can add the float width later, indeed.

            As for the "model name" — the server doesn't call the model directly (yet) and it has no way of verifying what model has generated the embedding, so it cannot enforce the model. You can use COMMENT "generated by openai-text-embedding-3-large" — it'll be just as good, purely informational for you, the server cannot enforce it anyway.

            serg Sergei Golubchik added a comment - may be. For now we plan something like VECTOR(3072) where it's always 32-bit floats. But we can add the float width later, indeed. As for the "model name" — the server doesn't call the model directly (yet) and it has no way of verifying what model has generated the embedding, so it cannot enforce the model. You can use COMMENT "generated by openai-text-embedding-3-large" — it'll be just as good, purely informational for you, the server cannot enforce it anyway.
            rdyas Robert Dyas added a comment -

            Comment on column works.

            rdyas Robert Dyas added a comment - Comment on column works.

            > VECTOR(3072, 32)

            Bit width alone isn't enough to specify a representation format. For example, FP16 and bfloat16 are both 16 bits, and int8 and BF8 are both 8 bits. The second, optional parameter on the vector column should probably be a string (or a keyword) like VECTOR(3072, "float32").

            piki Patrick Reynolds added a comment - > VECTOR(3072, 32) Bit width alone isn't enough to specify a representation format. For example, FP16 and bfloat16 are both 16 bits, and int8 and BF8 are both 8 bits. The second, optional parameter on the vector column should probably be a string (or a keyword) like VECTOR(3072, "float32").

            Right. Thanks!

            serg Sergei Golubchik added a comment - Right. Thanks!

            People

              serg Sergei Golubchik
              serg Sergei Golubchik
              Votes:
              4 Vote for this issue
              Watchers:
              16 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.