[MDEV-33410] VECTOR data type - Jira

Details

Type: New Feature
Status: Closed (View Workflow)
Priority: Critical
Resolution: Fixed
Fix Version/s: 11.7.1
Component/s: Data types, Vector search
Labels:
None

Description

VECTOR(N) data type, that is an array of N floating poing numbers in some unspecified internal format (IEEE 754 or something else)

Questions:
Should we allow any particular operators on these?

Attachments

Issue Links

blocks

MDEV-35031 Update on vector column returns error but modifies the value, results in further ER_KEY_NOT_FOUND

Closed

MDEV-35831 MYSQL_TYPE_VECTOR

Open

relates to

MDEV-31053 UUID(size) should be disallowed

Confirmed

MDEV-32885 VEC_DISTANCE() function

Closed

MDEV-32886 VEC_FromText() and VEC_ToText() functions

Closed

MDEV-35063 Assertion `v->distance_to_target >= threshold' fails upon adding certain values to vector key

Closed

MDEV-32887 vector search

Stalled

(2 relates to)

Activity

Ascending order - Click to sort in descending order

View 5 older comments

Robert Dyas added a comment - 2024-07-20 22:12

I'm new to this but it seems to me that a column should be defined as something like VECTOR(3072, 32) (3072 dimensions of 32 bit floats) where the 32bit is the default and is an optional paramter.

Vector values shoud be able to be input (insert/update) via JSON syntax and when you SELECT on a vector the default syntax seems like it should be JSON (for human readable render).

Also, it might be helpful if you can specify a col as VECTOR(dim, bit_depth, model_name) so the column definition has an optional user defined model name (maybe varchar(60) ) associated with it. The vector is so tightly tied to the model that produced it (say openai-text-embedding-3-large) that the model is conceptually part of the datatype... i.e. you have to generate embeddings in the same model to match against that vector stored in the db. Things are evolving so quickly and models are changing... seems like it would be nice to be able to track the model that applies to those vectors. Up to the user to ensure they do of course. I can see using different embedding models for different use cases and over time... having this as part of the datatype will reduce confusion.

Please forgive me if any of what I've said is stupid, as I'm on a learning curve with all this stuff.

Robert Dyas added a comment - 2024-07-20 22:12 I'm new to this but it seems to me that a column should be defined as something like VECTOR(3072, 32) (3072 dimensions of 32 bit floats) where the 32bit is the default and is an optional paramter. Vector values shoud be able to be input (insert/update) via JSON syntax and when you SELECT on a vector the default syntax seems like it should be JSON (for human readable render). Also, it might be helpful if you can specify a col as VECTOR(dim, bit_depth, model_name) so the column definition has an optional user defined model name (maybe varchar(60) ) associated with it. The vector is so tightly tied to the model that produced it (say openai-text-embedding-3-large) that the model is conceptually part of the datatype... i.e. you have to generate embeddings in the same model to match against that vector stored in the db. Things are evolving so quickly and models are changing... seems like it would be nice to be able to track the model that applies to those vectors. Up to the user to ensure they do of course. I can see using different embedding models for different use cases and over time... having this as part of the datatype will reduce confusion. Please forgive me if any of what I've said is stupid, as I'm on a learning curve with all this stuff.

Sergei Golubchik added a comment - 2024-07-21 21:07

may be. For now we plan something like VECTOR(3072) where it's always 32-bit floats. But we can add the float width later, indeed.

As for the "model name" — the server doesn't call the model directly (yet) and it has no way of verifying what model has generated the embedding, so it cannot enforce the model. You can use COMMENT "generated by openai-text-embedding-3-large" — it'll be just as good, purely informational for you, the server cannot enforce it anyway.

Sergei Golubchik added a comment - 2024-07-21 21:07 may be. For now we plan something like VECTOR(3072) where it's always 32-bit floats. But we can add the float width later, indeed. As for the "model name" — the server doesn't call the model directly (yet) and it has no way of verifying what model has generated the embedding, so it cannot enforce the model. You can use COMMENT "generated by openai-text-embedding-3-large" — it'll be just as good, purely informational for you, the server cannot enforce it anyway.

Robert Dyas added a comment - 2024-07-21 21:22

Comment on column works.

Robert Dyas added a comment - 2024-07-21 21:22 Comment on column works.

Patrick Reynolds added a comment - 2024-07-22 17:58

> VECTOR(3072, 32)

Bit width alone isn't enough to specify a representation format. For example, FP16 and bfloat16 are both 16 bits, and int8 and BF8 are both 8 bits. The second, optional parameter on the vector column should probably be a string (or a keyword) like VECTOR(3072, "float32").

Patrick Reynolds added a comment - 2024-07-22 17:58 > VECTOR(3072, 32) Bit width alone isn't enough to specify a representation format. For example, FP16 and bfloat16 are both 16 bits, and int8 and BF8 are both 8 bits. The second, optional parameter on the vector column should probably be a string (or a keyword) like VECTOR(3072, "float32").

Sergei Golubchik added a comment - 2024-07-22 19:01

Right. Thanks!

Sergei Golubchik added a comment - 2024-07-22 19:01 Right. Thanks!

People

Assignee:: Sergei Golubchik

Reporter:: Sergei Golubchik

Votes:: 4 Vote for this issue

Watchers:: 16 Start watching this issue

Dates

Created:: 2024-02-07 11:26

Updated:: 2025-03-15 20:48

Resolved:: 2024-11-06 13:58

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server