Details
-
New Feature
-
Status: Closed (View Workflow)
-
Critical
-
Resolution: Fixed
-
None
Description
VECTOR(N) data type, that is an array of N floating poing numbers in some unspecified internal format (IEEE 754 or something else)
Questions:
Should we allow any particular operators on these?
Attachments
Issue Links
- blocks
-
MDEV-35031 Update on vector column returns error but modifies the value, results in further ER_KEY_NOT_FOUND
-
- Closed
-
-
MDEV-35831 MYSQL_TYPE_VECTOR
-
- Open
-
- relates to
-
MDEV-31053 UUID(size) should be disallowed
-
- Confirmed
-
-
MDEV-32885 VEC_DISTANCE() function
-
- Closed
-
-
MDEV-32886 VEC_FromText() and VEC_ToText() functions
-
- Closed
-
-
MDEV-35063 Assertion `v->distance_to_target >= threshold' fails upon adding certain values to vector key
-
- Closed
-
-
MDEV-32887 vector search
-
- Stalled
-
Two proposals related to this:
First, we should standardize on a table-creation syntax. PlanetScale is using CREATE TABLE t(val VECTOR(4)), with the dimension being a required part of the type. We consider a vector a strong type, with a mandatory dimension, so any attempt to mix string data or vectors of different dimensions can be caught immediately.
Second, the storage format should have a version so it is self-describing. This is what keeps people from accidentally inserting a binary blob for a vector of the wrong format. PlanetScale's chosen format is:
For version==1, the "data" field is an array of 32-bit IEEE754 floats. Other versions could support other number formats like bfloat16, compression and packing schemes, and/or dimensions larger than 65535. Every vector serialization format will start with a 16-bit version, but the other fields can vary by version.
Why an explicit dimension, when we know that dimension is generally (blob_size-4)/4 ? First, it acts as a redundancy check against truncated data. And second, it allows for formats that involve compression, sparse storage, or float formats shorter than 8 bits. It's also a productive use of the otherwise empty space if you want the "data" field to be 32-bit aligned.