Details

    Description

      VECTOR(N) data type, that is an array of N floating poing numbers in some unspecified internal format (IEEE 754 or something else)

      Questions:
      Should we allow any particular operators on these?

      Attachments

        Issue Links

          Activity

            Two proposals related to this:

            First, we should standardize on a table-creation syntax. PlanetScale is using CREATE TABLE t(val VECTOR(4)), with the dimension being a required part of the type. We consider a vector a strong type, with a mandatory dimension, so any attempt to mix string data or vectors of different dimensions can be caught immediately.

            Second, the storage format should have a version so it is self-describing. This is what keeps people from accidentally inserting a binary blob for a vector of the wrong format. PlanetScale's chosen format is:

            • 16-bit little endian version
            • 16-bit little endian dimension
            • data

            For version==1, the "data" field is an array of 32-bit IEEE754 floats. Other versions could support other number formats like bfloat16, compression and packing schemes, and/or dimensions larger than 65535. Every vector serialization format will start with a 16-bit version, but the other fields can vary by version.

            Why an explicit dimension, when we know that dimension is generally (blob_size-4)/4 ? First, it acts as a redundancy check against truncated data. And second, it allows for formats that involve compression, sparse storage, or float formats shorter than 8 bits. It's also a productive use of the otherwise empty space if you want the "data" field to be 32-bit aligned.

            piki Patrick Reynolds added a comment - Two proposals related to this: First, we should standardize on a table-creation syntax. PlanetScale is using CREATE TABLE t(val VECTOR(4)), with the dimension being a required part of the type. We consider a vector a strong type, with a mandatory dimension, so any attempt to mix string data or vectors of different dimensions can be caught immediately. Second, the storage format should have a version so it is self-describing. This is what keeps people from accidentally inserting a binary blob for a vector of the wrong format. PlanetScale's chosen format is: 16-bit little endian version 16-bit little endian dimension data For version==1, the "data" field is an array of 32-bit IEEE754 floats. Other versions could support other number formats like bfloat16, compression and packing schemes, and/or dimensions larger than 65535. Every vector serialization format will start with a 16-bit version, but the other fields can vary by version. Why an explicit dimension, when we know that dimension is generally (blob_size-4)/4 ? First, it acts as a redundancy check against truncated data. And second, it allows for formats that involve compression, sparse storage, or float formats shorter than 8 bits. It's also a productive use of the otherwise empty space if you want the "data" field to be 32-bit aligned.
            serg Sergei Golubchik added a comment - - edited

            Yes, I agree that VECTOR(N) should be a strong type, with a mandatory dimensionality and it should only store valid vectors with exactly N dimensions.

            Format is arguable, it could be part of the metadata, like

            val VECTOR(1234) FORMAT=float32

            or

            val VECTOR(1234) '{"format":"float32"}'

            in your case. I'm not saying it would be better, only that it's an alternative.

            Number of dimensions is even more arguable. Redundancy — yes. Compression/spare/etc — it could be that format=1 means N 32-bit floats, but format=2 would mean a length and compression, for example. I mean, the length can be conditional, depending on the format.

            But indeed if you want to keep the data 32-bit aligned, then the 16-bit length is kind of free anyway, doesn't take any space.

            serg Sergei Golubchik added a comment - - edited Yes, I agree that VECTOR(N) should be a strong type, with a mandatory dimensionality and it should only store valid vectors with exactly N dimensions. Format is arguable, it could be part of the metadata, like val VECTOR(1234) FORMAT=float32 or val VECTOR(1234) '{"format":"float32"}' in your case. I'm not saying it would be better, only that it's an alternative. Number of dimensions is even more arguable. Redundancy — yes. Compression/spare/etc — it could be that format=1 means N 32-bit floats, but format=2 would mean a length and compression, for example. I mean, the length can be conditional, depending on the format. But indeed if you want to keep the data 32-bit aligned, then the 16-bit length is kind of free anyway, doesn't take any space.

            Metadata that's part of the column type (like val VECTOR(1234) FORMAT=float32) doesn't travel with the value in cases like

            SET @pt = VEC_FromText(...)
            

            or

            INSERT INTO t2(vec_col) SELECT vec_col FROM t1
            

            or cases where the user inserts vector data serialized by the client app using prepared statements. In all those cases, having the vector blob data be self-describing lets us check if it's the correct format when we go to insert it in a table or use it for a distance calculation.

            piki Patrick Reynolds added a comment - Metadata that's part of the column type (like val VECTOR(1234) FORMAT=float32) doesn't travel with the value in cases like SET @pt = VEC_FromText(...) or INSERT INTO t2(vec_col) SELECT vec_col FROM t1 or cases where the user inserts vector data serialized by the client app using prepared statements. In all those cases, having the vector blob data be self-describing lets us check if it's the correct format when we go to insert it in a table or use it for a distance calculation.
            ahmedmadbouly ahmedmadbouly added a comment - - edited

            I think that the essential mathematical transformations will be helpful in working with vectors , then it will be useful to allow shifting each element of the vector "+ and - operation to each number inside the vector" , also scaling the vector or shrinking it " multiplication and division" .

            also in case there is Coolum of type VECTOR(N) in specific table I think exactly search inside the vectors will be useful of course.
            I mean support query to return all records where the Vec[0] is equal to some number or Vec[1] greater than some value " for examples only" , where Vec the name of the Coolum of type VECTOR(N) .

            ahmedmadbouly ahmedmadbouly added a comment - - edited I think that the essential mathematical transformations will be helpful in working with vectors , then it will be useful to allow shifting each element of the vector "+ and - operation to each number inside the vector" , also scaling the vector or shrinking it " multiplication and division" . also in case there is Coolum of type VECTOR(N) in specific table I think exactly search inside the vectors will be useful of course. I mean support query to return all records where the Vec [0] is equal to some number or Vec [1] greater than some value " for examples only" , where Vec the name of the Coolum of type VECTOR(N) .
            inaam.rana Inaam Rana added a comment -

            I am generally more inclined towards leaving the version and specific format specifiers at column metadata level. This is more inline with how DBs typically deal with any datatype. True that we are performing operations on the data assuming it is a float array but then isn't it similar to have a blob column with index where user mistakenly put garbage in the blob. If we allow version to go on a per row basis, we are opening up potentially for a single column storing different versions which is likely to lead to more complexity down the road.
            We can perform length check during insertion. For example, in CloudSQL we create a constraint that inserted length is 4 * dimensions.

            inaam.rana Inaam Rana added a comment - I am generally more inclined towards leaving the version and specific format specifiers at column metadata level. This is more inline with how DBs typically deal with any datatype. True that we are performing operations on the data assuming it is a float array but then isn't it similar to have a blob column with index where user mistakenly put garbage in the blob. If we allow version to go on a per row basis, we are opening up potentially for a single column storing different versions which is likely to lead to more complexity down the road. We can perform length check during insertion. For example, in CloudSQL we create a constraint that inserted length is 4 * dimensions.
            rdyas Robert Dyas added a comment -

            I'm new to this but it seems to me that a column should be defined as something like VECTOR(3072, 32) (3072 dimensions of 32 bit floats) where the 32bit is the default and is an optional paramter.

            Vector values shoud be able to be input (insert/update) via JSON syntax and when you SELECT on a vector the default syntax seems like it should be JSON (for human readable render).

            Also, it might be helpful if you can specify a col as VECTOR(dim, bit_depth, model_name) so the column definition has an optional user defined model name (maybe varchar(60) ) associated with it. The vector is so tightly tied to the model that produced it (say openai-text-embedding-3-large) that the model is conceptually part of the datatype... i.e. you have to generate embeddings in the same model to match against that vector stored in the db. Things are evolving so quickly and models are changing... seems like it would be nice to be able to track the model that applies to those vectors. Up to the user to ensure they do of course. I can see using different embedding models for different use cases and over time... having this as part of the datatype will reduce confusion.

            Please forgive me if any of what I've said is stupid, as I'm on a learning curve with all this stuff.

            rdyas Robert Dyas added a comment - I'm new to this but it seems to me that a column should be defined as something like VECTOR(3072, 32) (3072 dimensions of 32 bit floats) where the 32bit is the default and is an optional paramter. Vector values shoud be able to be input (insert/update) via JSON syntax and when you SELECT on a vector the default syntax seems like it should be JSON (for human readable render). Also, it might be helpful if you can specify a col as VECTOR(dim, bit_depth, model_name) so the column definition has an optional user defined model name (maybe varchar(60) ) associated with it. The vector is so tightly tied to the model that produced it (say openai-text-embedding-3-large) that the model is conceptually part of the datatype... i.e. you have to generate embeddings in the same model to match against that vector stored in the db. Things are evolving so quickly and models are changing... seems like it would be nice to be able to track the model that applies to those vectors. Up to the user to ensure they do of course. I can see using different embedding models for different use cases and over time... having this as part of the datatype will reduce confusion. Please forgive me if any of what I've said is stupid, as I'm on a learning curve with all this stuff.

            may be. For now we plan something like VECTOR(3072) where it's always 32-bit floats. But we can add the float width later, indeed.

            As for the "model name" — the server doesn't call the model directly (yet) and it has no way of verifying what model has generated the embedding, so it cannot enforce the model. You can use COMMENT "generated by openai-text-embedding-3-large" — it'll be just as good, purely informational for you, the server cannot enforce it anyway.

            serg Sergei Golubchik added a comment - may be. For now we plan something like VECTOR(3072) where it's always 32-bit floats. But we can add the float width later, indeed. As for the "model name" — the server doesn't call the model directly (yet) and it has no way of verifying what model has generated the embedding, so it cannot enforce the model. You can use COMMENT "generated by openai-text-embedding-3-large" — it'll be just as good, purely informational for you, the server cannot enforce it anyway.
            rdyas Robert Dyas added a comment -

            Comment on column works.

            rdyas Robert Dyas added a comment - Comment on column works.

            > VECTOR(3072, 32)

            Bit width alone isn't enough to specify a representation format. For example, FP16 and bfloat16 are both 16 bits, and int8 and BF8 are both 8 bits. The second, optional parameter on the vector column should probably be a string (or a keyword) like VECTOR(3072, "float32").

            piki Patrick Reynolds added a comment - > VECTOR(3072, 32) Bit width alone isn't enough to specify a representation format. For example, FP16 and bfloat16 are both 16 bits, and int8 and BF8 are both 8 bits. The second, optional parameter on the vector column should probably be a string (or a keyword) like VECTOR(3072, "float32").

            Right. Thanks!

            serg Sergei Golubchik added a comment - Right. Thanks!

            People

              serg Sergei Golubchik
              serg Sergei Golubchik
              Votes:
              4 Vote for this issue
              Watchers:
              16 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.