[MDEV-33408] HNSW for k-ANN vector searches - Jira

Details

Type: New Feature
Status: Closed (View Workflow)
Priority: Critical
Resolution: Fixed
Fix Version/s: 11.7.1
Component/s: Vector search
Labels:
None

Description

For the first iteration of Vector search, we will implement HNSW algorithm.

The implementation will only support Euclidean distance initially.

Basic plan:
Graph construction will be done according to HNSW paper.

Storage wise, we'll store the graph as part of a subtable (~~MDEV-33404~~).

The table's definition will be something along these lines:

  CREATE TABLE i (

    level int unsigned not null,

    src varbinary(255) not null,

    dst varbinary(255) not null,

    index (level,src),

    index (level,dst));

For each link in the graph, there will be a corresponding entry in the table.

src and dst will store handler::position, a quick link to the actual vector blob in the main table.

The index (level,src) will allow for quick jumping between nodes.
To go deeper in search, one just needs to decrement the level and search using the same "src" value.

If src is found on level n, then it is also found on level n - 1 and so on. Level 0 is the base level with all the nodes.

Performance considerations:

Storing the vector in the subtable might be required. Looking up the blob value in the base table might be too costly.

Attachments

Issue Links

blocks

MDEV-33411 OPTIMIZE for graph indexes

Open

MDEV-33414 benchmark vector indexes

Closed

MDEV-33415 graph index search: heuristical edge choice

Closed

MDEV-33416 graph index: use smaller floating point numbers

Closed

MDEV-33418 graph index insert: stronger selection of neighbors

Closed

MDEV-33419 graph index insert: consider more neighbors

Open

is blocked by

MDEV-33404 Engine-independent indexes: subtable method

Closed

MDEV-36317 vector search with Cosine Distance, the recall rate of the returned results is very low

Open

MDEV-36338 vector search with Cosine Distance is slow

Closed

is part of

MDEV-34939 vector search in 11.7

Closed

relates to

MDEV-32887 vector search

Stalled

(1 blocks, 3 is blocked by, 1 is part of, 1 relates to)

Activity

Ascending order - Click to sort in descending order

Vicențiu Ciorbaru added a comment - 2024-02-26 15:54

Insert and search now are in a functioning state, although some refactoring is needed.

Vicențiu Ciorbaru added a comment - 2024-02-26 15:54 Insert and search now are in a functioning state, although some refactoring is needed.

Hugo Wen added a comment - 2024-05-25 00:08 - edited

Hi cvicentiu, serg, I'm doing some research on the DELETE algorithm for the HNSW index, I have summarized my current findings as following.
I'll try to create a PoC for option 3, but before dive deep into the implementation, I would like to seek early feedback on the feasibility of the options and any potential concerns regarding the preferred solution options.

In addition, should we create a separate Jira for this the DELETE/UPDATE task?

HNSW UPDATE/DELETE

Many HNSW implementations do not support update or delete vector. When users need to update or delete the vectors, they need to recreate the whole index. It introduces overly high costs.

The original HNSW paper does not provide any information or guidance regarding the update or deletion.

pgvector supports UPDATE/DELETE as summarized at the end of this comment.

High-level Options:

Option 1: Mark graph nodes as source-invalid instead of deleting or rebuilding the graph index upon DELETE/UPDATE operations. These invalid nodes can still be used during search or insertion, but they will not be included in the results or added to the neighbor lists of new nodes.
- pros: easier maintenance.
- cons: index size continues to grow, and query speed and recall may degrade if too many updates/deletes occur.
Option 2: Upon DELETE/UPDATE operations, traverse all nodes in the graph and delete/recreate the related connections.
- pros: minimal impact on recall, and the index is always up-to-date.
- cons: extremely slow process, as it requires traversing all nodes in the graph and rebuilding them for complete cleanup.
Option 3: Combine Options 1 and 2. [ preferred ]\
Mark records as source-invalid and use them for search but exclude them from results. Update the index following Option 2 only when the ANALYSIS table is called (investigate the possibility of triggering this cleanup during ANALYSIS).
- pros: minimizes performance impact, and users can update the index only when needed (during ANALYSIS).
Option 4: Simply do not support index maintenance during DELETE/UPDATE operations. Require a complete index rebuild after UPDATE/DELETE.
Option 5: Mark graph nodes as source-invalid and simply skip all those nodes as if they do not exist during search or insertion.
- this workaround will undoubtedly impact the recall.

Implementation if above option 3 selected:

MariaDB does not have an existing way to identify if one element in the index points to a valid or invalid record. MariaDB does immediately updates the index if the record is updated or deleted.

To save the invalid state:\
Option 2 is preferred as it aligns with the high-level index mechanism by using the same secondary table without introducing too much complexity.
- Option 1: Save another list of invalid references. This needs maintenance of an additional list and an extra logic when searching.
- Option 2: Add a column on the secondary table to mark the index records' state (invalid/valid).
On DELETE
- load the secondary table
- for each layer from layer_max to 0, mark the corresponding nodes to the source_invalid state.
On UPDATE
- changes of the graph in the secondary table are similar to INSERT + DELETE
On SEARCH/INSERT
- use all (invalid and valid) nodes for search.
- do not add invalid nodes to the results list
- do not count the invalid node in ef_search or ef_construct
- do not add invalid nodes to the neighbor lists of new nodes
One possible trigger to start cleaning up all invalid nodes from the graph could be the ANALYSIS table.
- Haven't checked much, needs more investigation.

On DELETE in pgvector

Term Explanations:

heap: In PostgreSQL, the term "heap" refers to the main storage structure used by PostgreSQL to store the actual data of a table.\
TID: In PostgreSQL, TID stands for "Tuple ID". It is a low-level identifier that uniquely identifies a specific row (tuple) within a table's heap (storage).

In pgvector HNSW index, it saves not only the TID of the data but also a copy of the vector value.

When the DELETE query is executed:

The heap TID is marked as invalid so later index search can check if the corresponding row is still valid or not.
But at that moment, nothing has changed in the index. I did not find an index interface called during the DELETE query either.

When a row is deleted but before vacuum happens:

If a search or insert happens, it still use those to-be-deleted elements in the index for search, but not add them in return results.
- It uses the copy of the vector data in the index element to compare the distance but does not read from the heap data.
- It uses the heaptidsLength to identify if it's a valid record to be added in the final results or ef counts. (I haven't figure out what triggers the heaptidsLength to be 0 when the DELETE query happens)

When vacuum executes:

It traverses through all index pages, gets a list of to-be-deleted nodes
Then it traverses through all index pages again, and for each node, it checks if its neighbor list contains any to-be-deleted nodes.
- if no, skip
- if yes, empty the neighbor list and rebuild similar to a new insert

Hugo Wen added a comment - 2024-05-25 00:08 - edited Hi cvicentiu , serg , I'm doing some research on the DELETE algorithm for the HNSW index, I have summarized my current findings as following. I'll try to create a PoC for option 3, but before dive deep into the implementation, I would like to seek early feedback on the feasibility of the options and any potential concerns regarding the preferred solution options. In addition, should we create a separate Jira for this the DELETE/UPDATE task? HNSW UPDATE/DELETE Many HNSW implementations do not support update or delete vector. When users need to update or delete the vectors, they need to recreate the whole index. It introduces overly high costs. The original HNSW paper does not provide any information or guidance regarding the update or deletion. pgvector supports UPDATE/DELETE as summarized at the end of this comment. High-level Options: Option 1: Mark graph nodes as source-invalid instead of deleting or rebuilding the graph index upon DELETE/UPDATE operations. These invalid nodes can still be used during search or insertion, but they will not be included in the results or added to the neighbor lists of new nodes. pros: easier maintenance. cons: index size continues to grow, and query speed and recall may degrade if too many updates/deletes occur. Option 2: Upon DELETE/UPDATE operations, traverse all nodes in the graph and delete/recreate the related connections. pros: minimal impact on recall, and the index is always up-to-date. cons: extremely slow process, as it requires traversing all nodes in the graph and rebuilding them for complete cleanup. Option 3: Combine Options 1 and 2. [ preferred ] \ Mark records as source-invalid and use them for search but exclude them from results. Update the index following Option 2 only when the ANALYSIS table is called (investigate the possibility of triggering this cleanup during ANALYSIS). pros: minimizes performance impact, and users can update the index only when needed (during ANALYSIS). Option 4: Simply do not support index maintenance during DELETE/UPDATE operations. Require a complete index rebuild after UPDATE/DELETE. Option 5: Mark graph nodes as source-invalid and simply skip all those nodes as if they do not exist during search or insertion. this workaround will undoubtedly impact the recall. Implementation if above option 3 selected: MariaDB does not have an existing way to identify if one element in the index points to a valid or invalid record. MariaDB does immediately updates the index if the record is updated or deleted. To save the invalid state:\ Option 2 is preferred as it aligns with the high-level index mechanism by using the same secondary table without introducing too much complexity. Option 1: Save another list of invalid references. This needs maintenance of an additional list and an extra logic when searching. Option 2: Add a column on the secondary table to mark the index records' state (invalid/valid). On DELETE load the secondary table for each layer from layer_max to 0, mark the corresponding nodes to the source_invalid state. On UPDATE changes of the graph in the secondary table are similar to INSERT + DELETE On SEARCH/INSERT use all (invalid and valid) nodes for search. do not add invalid nodes to the results list do not count the invalid node in ef_search or ef_construct do not add invalid nodes to the neighbor lists of new nodes One possible trigger to start cleaning up all invalid nodes from the graph could be the ANALYSIS table. Haven't checked much, needs more investigation. On DELETE in pgvector Term Explanations: heap: In PostgreSQL, the term "heap" refers to the main storage structure used by PostgreSQL to store the actual data of a table.\ TID: In PostgreSQL, TID stands for "Tuple ID". It is a low-level identifier that uniquely identifies a specific row (tuple) within a table's heap (storage). In pgvector HNSW index, it saves not only the TID of the data but also a copy of the vector value. When the DELETE query is executed: The heap TID is marked as invalid so later index search can check if the corresponding row is still valid or not. But at that moment, nothing has changed in the index. I did not find an index interface called during the DELETE query either. When a row is deleted but before vacuum happens: If a search or insert happens, it still use those to-be-deleted elements in the index for search, but not add them in return results. It uses the copy of the vector data in the index element to compare the distance but does not read from the heap data. It uses the heaptidsLength to identify if it's a valid record to be added in the final results or ef counts. (I haven't figure out what triggers the heaptidsLength to be 0 when the DELETE query happens) When vacuum executes : It traverses through all index pages, gets a list of to-be-deleted nodes Then it traverses through all index pages again, and for each node, it checks if its neighbor list contains any to-be-deleted nodes. if no, skip if yes, empty the neighbor list and rebuild similar to a new insert

Sergei Golubchik added a comment - 2024-05-27 19:55

Thanks, wenhug!

I suggest we do the simple approach that doesn't take much time to implement and is known to work. After that we can improve, using the existing implementation as a baseline.

So, let's start from doing 1, marking deleted rows. This could be done by adding a new column to the table, like vector BLOB. It'll be empty for rows that are present in the table, and if a row is deleted, it'll store the vector that used to be in the table. These deleted nodes should be used normally by the algorithm for searches, except that they cannot be added to the result set.

Sergei Golubchik added a comment - 2024-05-27 19:55 Thanks, wenhug ! I suggest we do the simple approach that doesn't take much time to implement and is known to work. After that we can improve, using the existing implementation as a baseline. So, let's start from doing 1, marking deleted rows. This could be done by adding a new column to the table, like vector BLOB . It'll be empty for rows that are present in the table, and if a row is deleted, it'll store the vector that used to be in the table. These deleted nodes should be used normally by the algorithm for searches, except that they cannot be added to the result set.

Hugo Wen added a comment - 2024-05-27 21:10

Thanks for the suggestion, Sergei. I agree that starting with option 1 of marking deleted rows is a good way to go. It is not a one-way door, we can introduce the cleanup part of Option 3 later on.

I'll start to add a new column for marking and saving the deleted vector. Now, I'm looking into how to load the secondary table during a DELETE operation.

Hugo Wen added a comment - 2024-05-27 21:10 Thanks for the suggestion, Sergei. I agree that starting with option 1 of marking deleted rows is a good way to go. It is not a one-way door, we can introduce the cleanup part of Option 3 later on. I'll start to add a new column for marking and saving the deleted vector. Now, I'm looking into how to load the secondary table during a DELETE operation.

Hugo Wen added a comment - 2024-05-29 18:43

Hi serg There's one issue with using the following second table and the vec blob column to store deleted values and identify whether the source was deleted.

  CREATE TABLE i (\

    layer int not null,\

    src varbinary(255) not null,           // ref of the source\

    neighbors varbinary(1000) not null,    // ref of the neighbors\

    vec blob default null,           // vector value of the source if source deleted\

    index (layer, src))

Currently, when retrieving neighbors during the search or insert operation, we get the reference of all neighbors and then obtain the actual vector values of the neighbors by using source->file->ha_rnd_pos to directly read the source record.\
The logic needs to change because the source ref may be invalid if the record was deleted. (correct me if I was wrong but I don't think MariaDB knows if the position is still valid or not)\
The logic would be as follows:

select vec from i where layer=0 and src=neigh_ref , using graph->file->ha_index_read_map in the code and check if vec is null.
If vec is null, read from the primary table data using ha_rnd_pos.
If vec is not null, the original data was deleted, and the value will be used for calculation.
This additional query will reduce the performance for normal search operations.\

If we have to perform ha_index_read_map anyway, another option is to always save the vector value in the second table and include another column to mark the source state. For example:

  CREATE TABLE i (\

    layer int not null,\

    src varbinary(255) not null,           // ref of the source\

    neighbors varbinary(1000) not null,    // ref of the neighbors\

    vec blob default null,                 // vector value of the source\

    src_state tinyint default 0,           // 0 if valid, 1 if source deleted\

    index (layer, src))

With this approach, during the search, it would only need to access the index.

It also makes it possible to perform some preprocessing, like quantization, during the index build to improve the search performance.
However, as my understanding is that ha_index_read_map could be slower than ha_rnd_pos, this approach might degrade performance compared to the current implementation without considering deletions.

I was thinking about to try this approach and test the performance change. What's your opinion about it? Does this approach worth a try or do you have other suggestions?

Hugo Wen added a comment - 2024-05-29 18:43 Hi serg There's one issue with using the following second table and the vec blob column to store deleted values and identify whether the source was deleted. CREATE TABLE i (\ layer int not null ,\ src varbinary( 255 ) not null , // ref of the source\ neighbors varbinary( 1000 ) not null , // ref of the neighbors\ vec blob default null , // vector value of the source if source deleted\ index (layer, src)) Currently, when retrieving neighbors during the search or insert operation, we get the reference of all neighbors and then obtain the actual vector values of the neighbors by using source->file->ha_rnd_pos to directly read the source record.\ The logic needs to change because the source ref may be invalid if the record was deleted. (correct me if I was wrong but I don't think MariaDB knows if the position is still valid or not)\ The logic would be as follows: select vec from i where layer=0 and src=neigh_ref , using graph->file->ha_index_read_map in the code and check if vec is null. If vec is null, read from the primary table data using ha_rnd_pos . If vec is not null, the original data was deleted, and the value will be used for calculation. This additional query will reduce the performance for normal search operations.\ If we have to perform ha_index_read_map anyway, another option is to always save the vector value in the second table and include another column to mark the source state. For example: CREATE TABLE i (\ layer int not null ,\ src varbinary( 255 ) not null , // ref of the source\ neighbors varbinary( 1000 ) not null , // ref of the neighbors\ vec blob default null , // vector value of the source\ src_state tinyint default 0 , // 0 if valid, 1 if source deleted\ index (layer, src)) With this approach, during the search, it would only need to access the index. It also makes it possible to perform some preprocessing, like quantization, during the index build to improve the search performance. However, as my understanding is that ha_index_read_map could be slower than ha_rnd_pos , this approach might degrade performance compared to the current implementation without considering deletions. I was thinking about to try this approach and test the performance change. What's your opinion about it? Does this approach worth a try or do you have other suggestions?

Sergei Golubchik added a comment - 2024-05-29 20:13 - edited

It is definitely worth a try. I always thought it would a useful tradeoff (time vs space) to try. But first I thought we needed to establish a baseline to compare against. If you think the current code is a good baseline — sure, please, go ahead and try it.

One of the benefits of this structure is that the vector in the index doesn't have to be the same as in the table. It could be preprocessed, e.g. converted to a smaller size floats or have less dimensions.

Sergei Golubchik added a comment - 2024-05-29 20:13 - edited It is definitely worth a try. I always thought it would a useful tradeoff (time vs space) to try. But first I thought we needed to establish a baseline to compare against. If you think the current code is a good baseline — sure, please, go ahead and try it. One of the benefits of this structure is that the vector in the index doesn't have to be the same as in the table. It could be preprocessed, e.g. converted to a smaller size floats or have less dimensions.

Hugo Wen added a comment - 2024-05-29 22:06

Thank you serg for the quick feedback. I don't have a great baseline but at least I have previous implementation in my pull request that I can compare to. I'll test how it impacts the performance.

The benchmark for bb-11.4-vec-preview, which is the source of cvicentiu's pull request https://github.com/MariaDB/server/pull/3257, is not performing as expected for some reason( the insert is as slow as 2 records per second) . So, at the moment, I can't use it as a baseline.

One of the benefits of this structure is that the vector in the index doesn't have to be the same as in the table. It could be preprocessed, e.g. converted to a smaller size floats or have less dimensions.

Exactly. While adding the vector data introduces overhead and could impact performance, it has the potential to improve search speed with the normalized data.

Hugo Wen added a comment - 2024-05-29 22:06 Thank you serg for the quick feedback. I don't have a great baseline but at least I have previous implementation in my pull request that I can compare to. I'll test how it impacts the performance. The benchmark for bb-11.4-vec-preview , which is the source of cvicentiu 's pull request https://github.com/MariaDB/server/pull/3257 , is not performing as expected for some reason( the insert is as slow as 2 records per second) . So, at the moment, I can't use it as a baseline. One of the benefits of this structure is that the vector in the index doesn't have to be the same as in the table. It could be preprocessed, e.g. converted to a smaller size floats or have less dimensions. Exactly. While adding the vector data introduces overhead and could impact performance, it has the potential to improve search speed with the normalized data.

Hugo Wen added a comment - 2024-06-06 18:11

(Not related to the delete algorithm) I've rewritten the select_neighbours function to match the Algorithm 4 from paper, I can now get very good recall with the benchmark tool.

    Found cached result

      0:                                  MariaDB(m=16, ef_construction=64, ef_search=40)        1.000      232.829

    Found cached result

      1:                                  MariaDB(m=50, ef_construction=10, ef_search=10)        0.998      272.387

diff --git a/sql/vector_mhnsw.cc b/sql/vector_mhnsw.cc

index 9768b3b6429..37f5bc9f553 100644

--- a/sql/vector_mhnsw.cc

+++ b/sql/vector_mhnsw.cc

@@ -233,6 +233,61 @@ static bool select_neighbours(TABLE *source, TABLE *graph,

   return false;

+static bool select_neighbours_heuristic(TABLE *source, TABLE *graph,

+                              const FVector &target,

+                              const List<FVector> &candidates,

+                              size_t max_neighbour_connections,

+                              List<FVector> *neighbours,

+                              bool keep_pruned= false)

+{

+  /*

+    TODO: If the input neighbours list is already sorted in search_layer, then

+    no need to do additional queue build steps here.

+   */

+  Queue<FVector, const FVector> pq;

+  pq.init(candidates.elements, 0, 0, cmp_vec, &target);

+  List<FVector> pruned;

+  // TODO(cvicentiu) error checking.

+  for (const auto &candidate : candidates)

+    pq.push(&candidate);

+  neighbours->push_back(pq.pop());

+  while (pq.elements() && neighbours->elements < max_neighbour_connections)

+  {

+    FVector *e= pq.pop();

+    bool selected= true;

+    for (const auto &candidate : *neighbours)

+    {

+      if (e->distance_to(candidate) < e->distance_to(target))

+      {

+        selected= false;

+        break;

+      }

+    }

+    if (!selected && keep_pruned)

+      pruned.push_back(e);

+    else if (selected)

+      neighbours->push_back(e);

+  }

+  if (keep_pruned)

+  {

+    while (pruned.elements && neighbours->elements < max_neighbour_connections)

+    {

+      neighbours->push_back(pruned.pop());

+    }

+  }

+  return false;

+}

/**

   Copy vector value to the records for future comparason of the deleted record, and indicating the source of the nodes are invalid.

*/

Hugo Wen added a comment - 2024-06-06 18:11 (Not related to the delete algorithm) I've rewritten the select_neighbours function to match the Algorithm 4 from paper, I can now get very good recall with the benchmark tool. Found cached result 0 : MariaDB(m= 16 , ef_construction= 64 , ef_search= 40 ) 1.000 232.829 Found cached result 1 : MariaDB(m= 50 , ef_construction= 10 , ef_search= 10 ) 0.998 272.387 diff --git a/sql/vector_mhnsw.cc b/sql/vector_mhnsw.cc index 9768b3b6429..37f5bc9f553 100644 --- a/sql/vector_mhnsw.cc +++ b/sql/vector_mhnsw.cc @@ - 233 , 6 + 233 , 61 @@ static bool select_neighbours(TABLE *source, TABLE *graph, return false ; } + + static bool select_neighbours_heuristic(TABLE *source, TABLE *graph, + const FVector &target, + const List<FVector> &candidates, + size_t max_neighbour_connections, + List<FVector> *neighbours, + bool keep_pruned= false ) +{ + /* + TODO: If the input neighbours list is already sorted in search_layer, then + no need to do additional queue build steps here. + */ + + Queue<FVector, const FVector> pq; + pq.init(candidates.elements, 0 , 0 , cmp_vec, &target); + + List<FVector> pruned; + + // TODO(cvicentiu) error checking. + for ( const auto &candidate : candidates) + pq.push(&candidate); + + neighbours->push_back(pq.pop()); + + while (pq.elements() && neighbours->elements < max_neighbour_connections) + { + FVector *e= pq.pop(); + bool selected= true ; + for ( const auto &candidate : *neighbours) + { + if (e->distance_to(candidate) < e->distance_to(target)) + { + selected= false ; + break ; + } + } + + if (!selected && keep_pruned) + pruned.push_back(e); + else if (selected) + neighbours->push_back(e); + } + + if (keep_pruned) + { + while (pruned.elements && neighbours->elements < max_neighbour_connections) + { + neighbours->push_back(pruned.pop()); + } + } + + return false ; +} + + /** Copy vector value to the records for future comparason of the deleted record, and indicating the source of the nodes are invalid. */

Sergei Golubchik added a comment - 2024-06-06 19:06

This was a bit of duplication of work, unfortunately. I've fixed it yesterday and pushed into the corresponding 11.6 branch

Sergei Golubchik added a comment - 2024-06-06 19:06 This was a bit of duplication of work, unfortunately. I've fixed it yesterday and pushed into the corresponding 11.6 branch

Hugo Wen added a comment - 2024-06-07 00:02

> the corresponding 11.6 branch
Is it https://github.com/MariaDB/server/tree/bb-11.6-MDEV-32887-vector ?

Besides the logic fix, my select_neighbours implementation comparing to the updated version in your branch are:

I intentionally did not run EXTEND_CANDIDATES. It does not significantly improve recall but impacts the speed a lot. ( this is the key issue in your branches which leads to super slow insert with the benchmarking tool. )
another small improvement is pq_discard does not need initialization or data insertion if KEEP_PRUNED_CONNECTIONS is not set. And it does not need to be implemented as a queue since the elements were already sorted before insertion.

Hugo Wen added a comment - 2024-06-07 00:02 > the corresponding 11.6 branch Is it https://github.com/MariaDB/server/tree/bb-11.6-MDEV-32887-vector ? Besides the logic fix, my select_neighbours implementation comparing to the updated version in your branch are: I intentionally did not run EXTEND_CANDIDATES. It does not significantly improve recall but impacts the speed a lot. ( this is the key issue in your branches which leads to super slow insert with the benchmarking tool. ) another small improvement is pq_discard does not need initialization or data insertion if KEEP_PRUNED_CONNECTIONS is not set. And it does not need to be implemented as a queue since the elements were already sorted before insertion.

Sergei Golubchik added a comment - 2024-06-07 07:50

1. Right, I kept it for completeness, but turned it off.
2. This is a good point, thanks.

Sergei Golubchik added a comment - 2024-06-07 07:50 1. Right, I kept it for completeness, but turned it off. 2. This is a good point, thanks.

BJ Quinn added a comment - 2024-07-02 16:50

My apologies, I've googled this to death but have not found the answer. Is this feature available in the 11.6 preview release? If not, I saw some comments about developer preview releases by the end of May, but I can't seem to find a link to those. Or is building https://github.com/MariaDB/server/tree/bb-11.6-MDEV-32887-vector from source the correct approach? Thanks!

BJ Quinn added a comment - 2024-07-02 16:50 My apologies, I've googled this to death but have not found the answer. Is this feature available in the 11.6 preview release? If not, I saw some comments about developer preview releases by the end of May, but I can't seem to find a link to those. Or is building https://github.com/MariaDB/server/tree/bb-11.6-MDEV-32887-vector from source the correct approach? Thanks!

Sergei Golubchik added a comment - 2024-07-02 18:39

It is not part of the 11.6 preview. There will be a separate preview with this feature only,

At the moment you can indeed build bb-11.6-MDEV-32887-vector to see what's there. It lacks https://github.com/MariaDB/server/pull/3321 (support for updates and deletes) and ~~MDEV-33413~~ (the cache exists in my private branch at the moment).

Both missing features already exist in some form, they need to be merged into the bb-11.6-MDEV-32887-vector branch and then we'll release a preview.

Sergei Golubchik added a comment - 2024-07-02 18:39 It is not part of the 11.6 preview. There will be a separate preview with this feature only, At the moment you can indeed build bb-11.6- MDEV-32887 -vector to see what's there. It lacks https://github.com/MariaDB/server/pull/3321 (support for updates and deletes) and MDEV-33413 (the cache exists in my private branch at the moment). Both missing features already exist in some form, they need to be merged into the bb-11.6- MDEV-32887 -vector branch and then we'll release a preview.

BJ Quinn added a comment - 2024-07-03 20:58

Got it, thanks! So I built bb-11.6-MDEV-32887-vector, and the build process seemed to go fine. In the log I can see at startup:

2024-07-03 15:45:53 0 [Note] Starting MariaDB 11.6.0-MariaDB source revision 77a016686ec2a2617dd6489a756b1f9f11a78d9f as process 27924

Which seems to be the latest commit on that branch as far as I can tell, so it looks like I've gotten the proper source. But when I run "ALTER TABLE data ADD COLUMN embedding VECTOR(100);", I get "SQL Error (4161): Unknown data type: 'VECTOR'". Is there something else I need to enable to test?

BJ Quinn added a comment - 2024-07-03 20:58 Got it, thanks! So I built bb-11.6- MDEV-32887 -vector, and the build process seemed to go fine. In the log I can see at startup: 2024-07-03 15:45:53 0 [Note] Starting MariaDB 11.6.0-MariaDB source revision 77a016686ec2a2617dd6489a756b1f9f11a78d9f as process 27924 Which seems to be the latest commit on that branch as far as I can tell, so it looks like I've gotten the proper source. But when I run "ALTER TABLE data ADD COLUMN embedding VECTOR(100);", I get "SQL Error (4161): Unknown data type: 'VECTOR'". Is there something else I need to enable to test?

Sergei Golubchik added a comment - 2024-07-04 06:27

No, nothing. VECTOR data type is ~~MDEV-33410~~, and it's open no work done on it yet.

We're going to implement it, of course, but it's not the first priority — it's a convenience feature that helps to avoid mistakes, but an application does not really need it, one can store and search embedding without a dedicated data type. We're prioritizing features that an application cannot work without. Functions VEC_FromText() and VEC_AsText() are also not a priority.

See the test file mysql-test/main/vector.test — that's how one uses it now, store in blob, insert as binary.
In python I do it like

c.execute('INSERT kb (emb) VALUES (?)', array.array('f',resp.data.embedding).tobytes())

Sergei Golubchik added a comment - 2024-07-04 06:27 No, nothing. VECTOR data type is MDEV-33410 , and it's open no work done on it yet. We're going to implement it, of course, but it's not the first priority — it's a convenience feature that helps to avoid mistakes, but an application does not really need it, one can store and search embedding without a dedicated data type. We're prioritizing features that an application cannot work without. Functions VEC_FromText() and VEC_AsText() are also not a priority. See the test file mysql-test/main/vector.test — that's how one uses it now, store in blob, insert as binary. In python I do it like c.execute( 'INSERT kb (emb) VALUES (?)' , array.array( 'f' ,resp.data.embedding).tobytes())

BJ Quinn added a comment - 2024-07-05 18:32

Great, that works, so I'll start testing my real workload against it. Thanks!

BJ Quinn added a comment - 2024-07-05 18:32 Great, that works, so I'll start testing my real workload against it. Thanks!

Sergei Golubchik added a comment - 2024-07-05 19:48

A word of caution, most performance optimizations, even if implemented, haven't been pushed into this branch yet.

Sergei Golubchik added a comment - 2024-07-05 19:48 A word of caution, most performance optimizations, even if implemented, haven't been pushed into this branch yet.

Hugo Wen added a comment - 2024-07-06 00:15 - edited

Hi serg, I summarize some findings regarding the scalar quantization using _Float16 that we discussed during our meeting today.

Draft commit to test _Float16 (2-byte float) in HNSW index: https://github.com/HugoWenTD/server/commit/9656b6c0d

Benchmarks indicate that using _Float16 instead of floats results in a 40-60% reduction in insertion speed and a 15-20% reduction in search speed. There is also a minor decrease in recall of less than 1%.

There are two issues with this solution (more research needed):

Converting 4-byte floats to 2-byte floats results in precision loss and a reduced range. Proportional scaling is necessary, but there is no simple method to define a proportion that works for all cases. Scaling must be done during transformation, and the best approach depends on the specific dataset and range of values.
- _Float16 range is -65504 ~ 65504
- If original floats and distance squares are all below this value the direct transform from float to _Float16 will work perfectly. e.g. [1, 2, 222], [0, -1, 0]
- However if original floats or distance squares are all out of the range, then scaling must be done during transformation. Otherwise the distance makes no sense at all as they are out of range. e.g. [6789, 1234], [-6789, -1234]
  - for dataset of mnist-784-euclidean, without scaling, the distance are bigger than FLT16_MAX and recall is 0. If divided the float by 1000 during transformation, then the recall becomes 0.978.
- One possible solution could be to allow for configuring a "proportion" parameter when they choose scalar quantization, which would enable the user to specify the appropriate scaling factor for their specific use case.
- Another possible solution might be define the corresponding data type (half-vector) and let the users to do the scaling before inserting the data.
_Float16 requires CPU instruction set support, otherwise it will revert to float and not utilize SIMD, leading to performance issues. In the commit I’m using -mf16c but looks it could be improved further.

Hugo Wen added a comment - 2024-07-06 00:15 - edited Hi serg , I summarize some findings regarding the scalar quantization using _Float16 that we discussed during our meeting today. Draft commit to test _Float16 (2-byte float) in HNSW index: https://github.com/HugoWenTD/server/commit/9656b6c0d Benchmarks indicate that using _Float16 instead of floats results in a 40-60% reduction in insertion speed and a 15-20% reduction in search speed. There is also a minor decrease in recall of less than 1%. There are two issues with this solution (more research needed): Converting 4-byte floats to 2-byte floats results in precision loss and a reduced range. Proportional scaling is necessary, but there is no simple method to define a proportion that works for all cases. Scaling must be done during transformation, and the best approach depends on the specific dataset and range of values. _Float16 range is -65504 ~ 65504 If original floats and distance squares are all below this value the direct transform from float to _Float16 will work perfectly. e.g. [1, 2, 222], [0, -1, 0] However if original floats or distance squares are all out of the range, then scaling must be done during transformation. Otherwise the distance makes no sense at all as they are out of range. e.g. [6789, 1234], [-6789, -1234] for dataset of mnist-784-euclidean, without scaling, the distance are bigger than FLT16_MAX and recall is 0. If divided the float by 1000 during transformation, then the recall becomes 0.978. One possible solution could be to allow for configuring a "proportion" parameter when they choose scalar quantization, which would enable the user to specify the appropriate scaling factor for their specific use case. Another possible solution might be define the corresponding data type (half-vector) and let the users to do the scaling before inserting the data. _Float16 requires CPU instruction set support, otherwise it will revert to float and not utilize SIMD, leading to performance issues. In the commit I’m using -mf16c but looks it could be improved further.

People

Assignee:: Vicențiu Ciorbaru

Reporter:: Sergei Golubchik

Votes:: 3 Vote for this issue

Watchers:: 12 Start watching this issue

Dates

Created:: 2024-02-07 11:21

Updated:: 1 week ago 08:15

Resolved:: 2024-11-10 13:52

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server