I'm new to this but it seems to me that a column should be defined as something like VECTOR(3072, 32) (3072 dimensions of 32 bit floats) where the 32bit is the default and is an optional paramter.
Vector values shoud be able to be input (insert/update) via JSON syntax and when you SELECT on a vector the default syntax seems like it should be JSON (for human readable render).
Also, it might be helpful if you can specify a col as VECTOR(dim, bit_depth, model_name) so the column definition has an optional user defined model name (maybe varchar(60) ) associated with it. The vector is so tightly tied to the model that produced it (say openai-text-embedding-3-large) that the model is conceptually part of the datatype... i.e. you have to generate embeddings in the same model to match against that vector stored in the db. Things are evolving so quickly and models are changing... seems like it would be nice to be able to track the model that applies to those vectors. Up to the user to ensure they do of course. I can see using different embedding models for different use cases and over time... having this as part of the datatype will reduce confusion.
Please forgive me if any of what I've said is stupid, as I'm on a learning curve with all this stuff.
Robert Dyas
added a comment - I'm new to this but it seems to me that a column should be defined as something like VECTOR(3072, 32) (3072 dimensions of 32 bit floats) where the 32bit is the default and is an optional paramter.
Vector values shoud be able to be input (insert/update) via JSON syntax and when you SELECT on a vector the default syntax seems like it should be JSON (for human readable render).
Also, it might be helpful if you can specify a col as VECTOR(dim, bit_depth, model_name) so the column definition has an optional user defined model name (maybe varchar(60) ) associated with it. The vector is so tightly tied to the model that produced it (say openai-text-embedding-3-large) that the model is conceptually part of the datatype... i.e. you have to generate embeddings in the same model to match against that vector stored in the db. Things are evolving so quickly and models are changing... seems like it would be nice to be able to track the model that applies to those vectors. Up to the user to ensure they do of course. I can see using different embedding models for different use cases and over time... having this as part of the datatype will reduce confusion.
Please forgive me if any of what I've said is stupid, as I'm on a learning curve with all this stuff.
may be. For now we plan something like VECTOR(3072) where it's always 32-bit floats. But we can add the float width later, indeed.
As for the "model name" — the server doesn't call the model directly (yet) and it has no way of verifying what model has generated the embedding, so it cannot enforce the model. You can use COMMENT "generated by openai-text-embedding-3-large" — it'll be just as good, purely informational for you, the server cannot enforce it anyway.
Sergei Golubchik
added a comment - may be. For now we plan something like VECTOR(3072) where it's always 32-bit floats. But we can add the float width later, indeed.
As for the "model name" — the server doesn't call the model directly (yet) and it has no way of verifying what model has generated the embedding, so it cannot enforce the model. You can use COMMENT "generated by openai-text-embedding-3-large" — it'll be just as good, purely informational for you, the server cannot enforce it anyway.
Bit width alone isn't enough to specify a representation format. For example, FP16 and bfloat16 are both 16 bits, and int8 and BF8 are both 8 bits. The second, optional parameter on the vector column should probably be a string (or a keyword) like VECTOR(3072, "float32").
Patrick Reynolds
added a comment - > VECTOR(3072, 32)
Bit width alone isn't enough to specify a representation format. For example, FP16 and bfloat16 are both 16 bits, and int8 and BF8 are both 8 bits. The second, optional parameter on the vector column should probably be a string (or a keyword) like VECTOR(3072, "float32").
I'm new to this but it seems to me that a column should be defined as something like VECTOR(3072, 32) (3072 dimensions of 32 bit floats) where the 32bit is the default and is an optional paramter.
Vector values shoud be able to be input (insert/update) via JSON syntax and when you SELECT on a vector the default syntax seems like it should be JSON (for human readable render).
Also, it might be helpful if you can specify a col as VECTOR(dim, bit_depth, model_name) so the column definition has an optional user defined model name (maybe varchar(60) ) associated with it. The vector is so tightly tied to the model that produced it (say openai-text-embedding-3-large) that the model is conceptually part of the datatype... i.e. you have to generate embeddings in the same model to match against that vector stored in the db. Things are evolving so quickly and models are changing... seems like it would be nice to be able to track the model that applies to those vectors. Up to the user to ensure they do of course. I can see using different embedding models for different use cases and over time... having this as part of the datatype will reduce confusion.
Please forgive me if any of what I've said is stupid, as I'm on a learning curve with all this stuff.