[MDEV-33410] VECTOR data type - Jira

Details

Type: New Feature
Status: Closed (View Workflow)
Priority: Critical
Resolution: Fixed
Fix Version/s: 11.7.1
Component/s: Data types, Vector search
Labels:
None

Description

VECTOR(N) data type, that is an array of N floating poing numbers in some unspecified internal format (IEEE 754 or something else)

Questions:
Should we allow any particular operators on these?

Attachments

Issue Links

blocks

MDEV-35031 Update on vector column returns error but modifies the value, results in further ER_KEY_NOT_FOUND

Closed

MDEV-35831 MYSQL_TYPE_VECTOR

Open

relates to

MDEV-31053 UUID(size) should be disallowed

Confirmed

MDEV-32885 VEC_DISTANCE() function

Closed

MDEV-32886 VEC_FromText() and VEC_ToText() functions

Closed

MDEV-35063 Assertion `v->distance_to_target >= threshold' fails upon adding certain values to vector key

Closed

MDEV-32887 vector search

Stalled

(2 relates to)

Activity

Ascending order - Click to sort in descending order

Patrick Reynolds added a comment - 2024-03-13 15:33

Two proposals related to this:

First, we should standardize on a table-creation syntax. PlanetScale is using CREATE TABLE t(val VECTOR(4)), with the dimension being a required part of the type. We consider a vector a strong type, with a mandatory dimension, so any attempt to mix string data or vectors of different dimensions can be caught immediately.

Second, the storage format should have a version so it is self-describing. This is what keeps people from accidentally inserting a binary blob for a vector of the wrong format. PlanetScale's chosen format is:

16-bit little endian version
16-bit little endian dimension
data

For version==1, the "data" field is an array of 32-bit IEEE754 floats. Other versions could support other number formats like bfloat16, compression and packing schemes, and/or dimensions larger than 65535. Every vector serialization format will start with a 16-bit version, but the other fields can vary by version.

Why an explicit dimension, when we know that dimension is generally (blob_size-4)/4 ? First, it acts as a redundancy check against truncated data. And second, it allows for formats that involve compression, sparse storage, or float formats shorter than 8 bits. It's also a productive use of the otherwise empty space if you want the "data" field to be 32-bit aligned.

Patrick Reynolds added a comment - 2024-03-13 15:33 Two proposals related to this: First, we should standardize on a table-creation syntax. PlanetScale is using CREATE TABLE t(val VECTOR(4)), with the dimension being a required part of the type. We consider a vector a strong type, with a mandatory dimension, so any attempt to mix string data or vectors of different dimensions can be caught immediately. Second, the storage format should have a version so it is self-describing. This is what keeps people from accidentally inserting a binary blob for a vector of the wrong format. PlanetScale's chosen format is: 16-bit little endian version 16-bit little endian dimension data For version==1, the "data" field is an array of 32-bit IEEE754 floats. Other versions could support other number formats like bfloat16, compression and packing schemes, and/or dimensions larger than 65535. Every vector serialization format will start with a 16-bit version, but the other fields can vary by version. Why an explicit dimension, when we know that dimension is generally (blob_size-4)/4 ? First, it acts as a redundancy check against truncated data. And second, it allows for formats that involve compression, sparse storage, or float formats shorter than 8 bits. It's also a productive use of the otherwise empty space if you want the "data" field to be 32-bit aligned.

Sergei Golubchik added a comment - 2024-03-13 18:43 - edited

Yes, I agree that VECTOR(N) should be a strong type, with a mandatory dimensionality and it should only store valid vectors with exactly N dimensions.

Format is arguable, it could be part of the metadata, like

val VECTOR(1234) FORMAT=float32

val VECTOR(1234) '{"format":"float32"}'

in your case. I'm not saying it would be better, only that it's an alternative.

Number of dimensions is even more arguable. Redundancy — yes. Compression/spare/etc — it could be that format=1 means N 32-bit floats, but format=2 would mean a length and compression, for example. I mean, the length can be conditional, depending on the format.

But indeed if you want to keep the data 32-bit aligned, then the 16-bit length is kind of free anyway, doesn't take any space.

Sergei Golubchik added a comment - 2024-03-13 18:43 - edited Yes, I agree that VECTOR(N) should be a strong type, with a mandatory dimensionality and it should only store valid vectors with exactly N dimensions. Format is arguable, it could be part of the metadata, like val VECTOR(1234) FORMAT=float32 or val VECTOR(1234) '{"format":"float32"}' in your case. I'm not saying it would be better, only that it's an alternative. Number of dimensions is even more arguable. Redundancy — yes. Compression/spare/etc — it could be that format=1 means N 32-bit floats, but format=2 would mean a length and compression, for example. I mean, the length can be conditional, depending on the format. But indeed if you want to keep the data 32-bit aligned, then the 16-bit length is kind of free anyway, doesn't take any space.

Patrick Reynolds added a comment - 2024-03-13 21:17

Metadata that's part of the column type (like val VECTOR(1234) FORMAT=float32) doesn't travel with the value in cases like

SET @pt = VEC_FromText(...)

INSERT INTO t2(vec_col) SELECT vec_col FROM t1

or cases where the user inserts vector data serialized by the client app using prepared statements. In all those cases, having the vector blob data be self-describing lets us check if it's the correct format when we go to insert it in a table or use it for a distance calculation.

Patrick Reynolds added a comment - 2024-03-13 21:17 Metadata that's part of the column type (like val VECTOR(1234) FORMAT=float32) doesn't travel with the value in cases like SET @pt = VEC_FromText(...) or INSERT INTO t2(vec_col) SELECT vec_col FROM t1 or cases where the user inserts vector data serialized by the client app using prepared statements. In all those cases, having the vector blob data be self-describing lets us check if it's the correct format when we go to insert it in a table or use it for a distance calculation.

ahmedmadbouly added a comment - 2024-03-20 01:59 - edited

I think that the essential mathematical transformations will be helpful in working with vectors , then it will be useful to allow shifting each element of the vector "+ and - operation to each number inside the vector" , also scaling the vector or shrinking it " multiplication and division" .

also in case there is Coolum of type VECTOR(N) in specific table I think exactly search inside the vectors will be useful of course.
I mean support query to return all records where the Vec[0] is equal to some number or Vec[1] greater than some value " for examples only" , where Vec the name of the Coolum of type VECTOR(N) .

ahmedmadbouly added a comment - 2024-03-20 01:59 - edited I think that the essential mathematical transformations will be helpful in working with vectors , then it will be useful to allow shifting each element of the vector "+ and - operation to each number inside the vector" , also scaling the vector or shrinking it " multiplication and division" . also in case there is Coolum of type VECTOR(N) in specific table I think exactly search inside the vectors will be useful of course. I mean support query to return all records where the Vec [0] is equal to some number or Vec [1] greater than some value " for examples only" , where Vec the name of the Coolum of type VECTOR(N) .

Inaam Rana added a comment - 2024-04-17 22:08

I am generally more inclined towards leaving the version and specific format specifiers at column metadata level. This is more inline with how DBs typically deal with any datatype. True that we are performing operations on the data assuming it is a float array but then isn't it similar to have a blob column with index where user mistakenly put garbage in the blob. If we allow version to go on a per row basis, we are opening up potentially for a single column storing different versions which is likely to lead to more complexity down the road.
We can perform length check during insertion. For example, in CloudSQL we create a constraint that inserted length is 4 * dimensions.

Inaam Rana added a comment - 2024-04-17 22:08 I am generally more inclined towards leaving the version and specific format specifiers at column metadata level. This is more inline with how DBs typically deal with any datatype. True that we are performing operations on the data assuming it is a float array but then isn't it similar to have a blob column with index where user mistakenly put garbage in the blob. If we allow version to go on a per row basis, we are opening up potentially for a single column storing different versions which is likely to lead to more complexity down the road. We can perform length check during insertion. For example, in CloudSQL we create a constraint that inserted length is 4 * dimensions.

Robert Dyas added a comment - 2024-07-20 22:12

I'm new to this but it seems to me that a column should be defined as something like VECTOR(3072, 32) (3072 dimensions of 32 bit floats) where the 32bit is the default and is an optional paramter.

Vector values shoud be able to be input (insert/update) via JSON syntax and when you SELECT on a vector the default syntax seems like it should be JSON (for human readable render).

Also, it might be helpful if you can specify a col as VECTOR(dim, bit_depth, model_name) so the column definition has an optional user defined model name (maybe varchar(60) ) associated with it. The vector is so tightly tied to the model that produced it (say openai-text-embedding-3-large) that the model is conceptually part of the datatype... i.e. you have to generate embeddings in the same model to match against that vector stored in the db. Things are evolving so quickly and models are changing... seems like it would be nice to be able to track the model that applies to those vectors. Up to the user to ensure they do of course. I can see using different embedding models for different use cases and over time... having this as part of the datatype will reduce confusion.

Please forgive me if any of what I've said is stupid, as I'm on a learning curve with all this stuff.

Robert Dyas added a comment - 2024-07-20 22:12 I'm new to this but it seems to me that a column should be defined as something like VECTOR(3072, 32) (3072 dimensions of 32 bit floats) where the 32bit is the default and is an optional paramter. Vector values shoud be able to be input (insert/update) via JSON syntax and when you SELECT on a vector the default syntax seems like it should be JSON (for human readable render). Also, it might be helpful if you can specify a col as VECTOR(dim, bit_depth, model_name) so the column definition has an optional user defined model name (maybe varchar(60) ) associated with it. The vector is so tightly tied to the model that produced it (say openai-text-embedding-3-large) that the model is conceptually part of the datatype... i.e. you have to generate embeddings in the same model to match against that vector stored in the db. Things are evolving so quickly and models are changing... seems like it would be nice to be able to track the model that applies to those vectors. Up to the user to ensure they do of course. I can see using different embedding models for different use cases and over time... having this as part of the datatype will reduce confusion. Please forgive me if any of what I've said is stupid, as I'm on a learning curve with all this stuff.

Sergei Golubchik added a comment - 2024-07-21 21:07

may be. For now we plan something like VECTOR(3072) where it's always 32-bit floats. But we can add the float width later, indeed.

As for the "model name" — the server doesn't call the model directly (yet) and it has no way of verifying what model has generated the embedding, so it cannot enforce the model. You can use COMMENT "generated by openai-text-embedding-3-large" — it'll be just as good, purely informational for you, the server cannot enforce it anyway.

Sergei Golubchik added a comment - 2024-07-21 21:07 may be. For now we plan something like VECTOR(3072) where it's always 32-bit floats. But we can add the float width later, indeed. As for the "model name" — the server doesn't call the model directly (yet) and it has no way of verifying what model has generated the embedding, so it cannot enforce the model. You can use COMMENT "generated by openai-text-embedding-3-large" — it'll be just as good, purely informational for you, the server cannot enforce it anyway.

Robert Dyas added a comment - 2024-07-21 21:22

Comment on column works.

Robert Dyas added a comment - 2024-07-21 21:22 Comment on column works.

Patrick Reynolds added a comment - 2024-07-22 17:58

> VECTOR(3072, 32)

Bit width alone isn't enough to specify a representation format. For example, FP16 and bfloat16 are both 16 bits, and int8 and BF8 are both 8 bits. The second, optional parameter on the vector column should probably be a string (or a keyword) like VECTOR(3072, "float32").

Patrick Reynolds added a comment - 2024-07-22 17:58 > VECTOR(3072, 32) Bit width alone isn't enough to specify a representation format. For example, FP16 and bfloat16 are both 16 bits, and int8 and BF8 are both 8 bits. The second, optional parameter on the vector column should probably be a string (or a keyword) like VECTOR(3072, "float32").

Sergei Golubchik added a comment - 2024-07-22 19:01

Right. Thanks!

Sergei Golubchik added a comment - 2024-07-22 19:01 Right. Thanks!

People

Assignee:: Sergei Golubchik

Reporter:: Sergei Golubchik

Votes:: 4 Vote for this issue

Watchers:: 16 Start watching this issue

Dates

Created:: 2024-02-07 11:26

Updated:: 2025-03-15 20:48

Resolved:: 2024-11-06 13:58

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server